Working with SQL in Kedro¶

This page outlines the various approaches and recommended practices when building a Kedro pipeline with SQL databases.

Key concepts and assumptions¶

When working with SQL databases in Kedro, it’s helpful to keep the following concepts front of mind.

1. Storage vs. Execution¶

You can treat a SQL database purely as a data sink/source, or
Leverage the database engine’s compute power whenever possible.

2. CRUD Scope¶

Kedro handles the parts of CRUD which do not require some concept of “state.”
Kedro’s genesis centred around reproducible science, not operational concepts like backfills.

Operation	Supported?	Notes
Create	✅	Writing new tables (or views)
Read	✅	Loading data into DataFrames or Ibis expressions
Update	❌	Requires custom SQL outside Kedro’s DAG; breaks lineage
Delete	❌	Same as Update—avoid for full reproducibility

3. DataFrame-centric¶

Kedro’s DAG and built-in datasets are optimised around Python DataFrames.
Embedding complex SQL scripts reduces visibility into dependencies and hinders reproducibility. See Delta Lake integration guide for more on handling UPSERT/MERGE like operations.

4. Reproducibility & injection risk¶

Raw SQL strings can bypass Kedro’s topological sorting and introduce SQL-injection vulnerabilities.
Prefer testable, Pythonic expression APIs (Ibis, Spark) over raw SQL queries.

Approaches¶

Approach	Status	Setup Complexity
Pandas SQL	Legacy	Minimal
Spark-JDBC	Legacy	High (JVM/cluster)
Ibis	Modern	Minimal

1. Pandas SQL (legacy)¶

events_data:
  type: pandas.SQLQueryDataSet
  sql: SELECT * FROM events WHERE date > '2025-01-01'
  credentials: sql_creds

Mechanism Wraps pd.read_sql_query/table and pd.DataFrame.to_sql.
Pros
- No extra services or JVM
- Pandas-native
Cons
- Loads full tables in memory
- Bypasses database optimizations
- Hard to parameterise/test
- Potential cost/security on large dumps

2. Spark-JDBC (legacy)¶

sales_data:
  type: spark.JDBCDataset
  table: sales
  credentials: jdbc_creds

Mechanism Spark’s DataFrame API over JDBC. Leverages predicate pushdown so filters/projections occur in-database.
Pros
- Distributed, lazy compute
- Leverages cluster power
Cons
- JVM + Spark cluster required
- Overkill for smaller workloads

3. Ibis (modern)¶

Tip

Please also check out our blog on building scalable data pipelines with Kedro and Ibis, SQL data processing in Kedro ML pipelines, and our PyData London talk.

ibis_orders:
  type: ibis.TableDataset
  backend: snowflake
  credentials: snowflake_creds
  table: orders
  save_args:
    materialized: table
    database: my_db.DEV

Mechanism Python expression trees → compiled SQL at runtime.
Key highlights
- SQLGlot translation across 20+ engines
- Support for extensions (DuckDB geospatial, PostGIS, Snowflake UDFs and more)
- Full lazy push-down (filters, joins, aggregates)
- Pluggable materialization (table/view/CTE)
- No JVM/Spark overhead
- In-memory/mock backends for testing
Pros
- Consistent API everywhere
- Retains engine optimizations
- Excellent testability & reproducibility
Cons
- Some backends may miss niche SQL operations
- Migration effort for legacy pipelines may not justify moving functional code, prioritise new pipelines / projects

Warning

Consult the Ibis support matrix to verify that needed SQL functions are available. You can still write raw SQL for missing features.

Recommended Ibis development workflow¶

Iterate locally for instant, low cost feedback. Deploy to production your warehouse for scale—ensuring consistency, testability, and zero drift between environments.

DuckDB (or SQLlite) for Development & CI ¶

In-memory, zero-config engine

Fast startup for local iteration and unit tests

# conf/base/catalog.yml
orders:
  type: ibis.TableDataset
  backend: duckdb
  filepath: data/03_primary/dev.duckdb
  table: orders

Warehouse Backend for Production & CD (for example Snowflake or BigQuery)¶

Simply swap your catalog’s backend and credentials
Same Ibis expressions compile to your warehouse’s SQL

Leverage scale, governance, cost controls

# conf/prod/catalog.yml
orders:
  type: ibis.TableDataset
  backend: snowflake         # or bigquery and more.
  credentials: warehouse_creds
  table: analytics.orders

Example pattern:

Limitations and key design decisions¶

More broadly, there are some wider limitations to be aware of when working with Kedro & SQL:

No conditional branching Kedro does not support conditional nodes, making UPSERT logic difficult. Kedro favors reproducibility with append & overwrite modes over in-place updates/deletes.
When to seek SQL-first alternatives If your workflow is entirely SQL, tools like dbt or SQLMesh offer richer lineage and transformation management than Kedro’s Python-centric approach.