Databricks CDC (Change Data Capture)
Change Data Capture (CDC) in Databricks is a pattern for processing only the
changes (inserts, updates, deletes) from source systems instead of repeatedly
reloading full tables. This enables near real-time analytics, efficient data pipelines, and
simpler auditability on the Databricks Lakehouse.
What Is Change Data Capture?
- CDC tracks row-level changes (INSERT, UPDATE, DELETE) in source systems.
- Only the delta (what changed since last read) is processed.
- Supports incremental data pipelines instead of full refreshes.
Why Use CDC?
- Lower cost: Less data scanned and written.
- Lower latency: Faster delivery of fresh data to downstream consumers.
- Better scalability: Handles large tables and high-change volumes.
- Audit and history: Easier to reconstruct how data changed over time.
CDC in the Databricks Lakehouse
Databricks typically organizes CDC pipelines using the Bronze / Silver / Gold
layering pattern on top of Delta Lake tables:
- Bronze: Raw, ingested CDC events (from logs, queues, or files).
- Silver: Cleaned, merged, de-duplicated tables with current state and history.
- Gold: Business-ready aggregates, marts, and feature tables.
Key Databricks Building Blocks for CDC
- Delta Lake: Transactional storage format supporting ACID and MERGE operations.
- Delta Change Data Feed (CDF): Exposes row-level changes from Delta tables.
- Structured Streaming: For continuous ingestion and processing of change events.
- Auto Loader: Incrementally processes new files from cloud storage.
- Delta Live Tables (DLT): Declarative ETL framework for building CDC-aware pipelines.
Typical CDC Architectures on Databricks
1. Log-Based CDC Into Databricks
In this pattern, a separate CDC tool reads database logs and pushes changes to Databricks.
- A CDC tool (e.g., Debezium, Fivetran, etc.) captures database changes (INSERT, UPDATE, DELETE).
- Changes are written as events to:
- cloud storage (e.g., JSON/Avro/Parquet files), or
- a message bus (e.g., Kafka), which Databricks then reads.
- Databricks Structured Streaming or Auto Loader ingests these events into a Bronze Delta table.
- Silver tables apply business logic:
- MERGE events into a target Delta table.
- Maintain both current and historical views.
- Gold tables build aggregates and reporting layers.
2. Using Delta Change Data Feed (CDF)
When your source is already a Delta table, you can use Change Data Feed to read only
changed rows instead of scanning the full table.
- Enable CDF on a Delta table (so it records row-level changes).
- Downstream pipelines use CDF to read changes from a given version or timestamp.
- Silver/Gold tables consume only the new changes each run.
3. Batch CDC from Snapshots (Pseudo-CDC)
If the source system provides only full snapshots, Databricks can compute changes between
current and previous snapshots.
- Ingest each snapshot into a Bronze Delta table with a snapshot_date.
- Compute differences between the latest and previous snapshots:
- New rows ⇒ inserts
- Changed rows ⇒ updates
- Missing rows ⇒ deletes
- Apply results as MERGE operations to the Silver table.
How CDC Is Applied in Delta Tables
Upserts Using MERGE
A common CDC pattern in Databricks is to MERGE change events into a target Delta table:
- Match on a business key (e.g., customer_id, order_id).
- When a row exists:
- Apply UPDATE or DELETE logic based on the CDC operation type.
- When a row does not exist:
Slowly Changing Dimensions (SCD) with CDC
For dimensional models, Databricks CDC is often combined with SCD Type 1 and
SCD Type 2 patterns:
- SCD Type 1: Overwrite old values with new values (no history).
- SCD Type 2: Keep full history with effective/from and to dates, and a current flag.
CDC with Delta Live Tables (DLT)
Delta Live Tables can simplify CDC implementations by managing dependencies, ordering, and
fault tolerance for you.
- Define Bronze, Silver, and Gold tables declaratively.
- Use streaming or triggered modes to process new CDC data automatically.
- Leverage built-in expectations (data quality rules) to validate change records.
Best Practices for Databricks CDC
- Use Delta Lake for all CDC tables (Bronze, Silver, Gold).
- Partition tables by high-cardinality, time-based columns to optimize incremental reads.
- Store immutable change events in Bronze; do not overwrite raw CDC.
- Normalize CDC event schema (operation type, timestamps, source metadata).
- Monitor pipeline health (lag, error rates, schema drift) in Databricks jobs or DLT.
- Plan retention for historical CDC data to balance cost vs. audit requirements.
Summary
- Databricks CDC focuses on processing only changed data instead of full reloads.
- It uses Delta Lake, MERGE, and optionally CDF and DLT.
- Typical architecture uses Bronze/Silver/Gold layers for raw, curated, and business data.
- Proper CDC design improves freshness, performance, and cost efficiency of data pipelines on Databricks.