Working with SQL in Kedro

This page outlines the various approaches and recommended practices when building a Kedro pipeline with SQL databases.

Key concepts and assumptions

When working with SQL databases in Kedro, it’s helpful to keep the following concepts front of mind.

1. Storage vs. Execution

  • You can treat a SQL database purely as a data sink/source, or

  • Leverage the database engine’s compute power whenever possible.

2. CRUD Scope

  • Kedro handles the parts of CRUD which do not require some concept of “state.”

  • Kedro’s genesis centred around reproducible science, not operational concepts like backfills.

Operation

Supported?

Notes

Create

Writing new tables (or views)

Read

Loading data into DataFrames or Ibis expressions

Update

Requires custom SQL outside Kedro’s DAG; breaks lineage

Delete

Same as Update—avoid for full reproducibility

3. DataFrame-centric

  • Kedro’s DAG and built-in datasets are optimised around Python DataFrames.

  • Embedding complex SQL scripts reduces visibility into dependencies and hinders reproducibility. See Delta Lake integration guide for more on handling UPSERT/MERGE like operations.

4. Reproducibility & injection risk

  • Raw SQL strings can bypass Kedro’s topological sorting and introduce SQL-injection vulnerabilities.

  • Prefer testable, Pythonic expression APIs (Ibis, Spark) over raw SQL queries.

Approaches

Approach

Status

Setup Complexity

Pandas SQL

Legacy

Minimal

Spark-JDBC

Legacy

High (JVM/cluster)

Ibis

Modern

Minimal


1. Pandas SQL (legacy)

events_data:
  type: pandas.SQLQueryDataSet
  sql: SELECT * FROM events WHERE date > '2025-01-01'
  credentials: sql_creds
  • Mechanism Wraps pd.read_sql_query/table and pd.DataFrame.to_sql.

  • Pros

    • No extra services or JVM

    • Pandas-native

  • Cons

    • Loads full tables in memory

    • Bypasses database optimizations

    • Hard to parameterise/test

    • Potential cost/security on large dumps


2. Spark-JDBC (legacy)

sales_data:
  type: spark.JDBCDataset
  table: sales
  credentials: jdbc_creds
  • Mechanism Spark’s DataFrame API over JDBC. Leverages predicate pushdown so filters/projections occur in-database.

  • Pros

    • Distributed, lazy compute

    • Leverages cluster power

  • Cons

    • JVM + Spark cluster required

    • Overkill for smaller workloads


3. Ibis (modern)

ibis_orders:
  type: ibis.TableDataset
  backend: snowflake
  credentials: snowflake_creds
  table: orders
  save_args:
    materialized: table
    database: my_db.DEV
  • Mechanism Python expression trees → compiled SQL at runtime.

  • Key highlights

    • SQLGlot translation across 20+ engines

    • Support for extensions (DuckDB geospatial, PostGIS, Snowflake UDFs and more)

    • Full lazy push-down (filters, joins, aggregates)

    • Pluggable materialization (table/view/CTE)

    • No JVM/Spark overhead

    • In-memory/mock backends for testing

  • Pros

    • Consistent API everywhere

    • Retains engine optimizations

    • Excellent testability & reproducibility

  • Cons

    • Some backends may miss niche SQL operations

    • Migration effort for legacy pipelines may not justify moving functional code, prioritise new pipelines / projects

Warning

Consult the Ibis support matrix to verify that needed SQL functions are available. You can still write raw SQL for missing features.

Limitations and key design decisions

More broadly, there are some wider limitations to be aware of when working with Kedro & SQL:

  • No conditional branching Kedro does not support conditional nodes, making UPSERT logic difficult. Kedro favors reproducibility with append & overwrite modes over in-place updates/deletes.

  • When to seek SQL-first alternatives If your workflow is entirely SQL, tools like dbt or SQLMesh offer richer lineage and transformation management than Kedro’s Python-centric approach.