Databricks CDC (Change Data Capture)

Change Data Capture (CDC) in Databricks is a pattern for processing only the changes (inserts, updates, deletes) from source systems instead of repeatedly reloading full tables. This enables near real-time analytics, efficient data pipelines, and simpler auditability on the Databricks Lakehouse.

What Is Change Data Capture?

CDC tracks row-level changes (INSERT, UPDATE, DELETE) in source systems.
Only the delta (what changed since last read) is processed.
Supports incremental data pipelines instead of full refreshes.

Why Use CDC?

Lower cost: Less data scanned and written.
Lower latency: Faster delivery of fresh data to downstream consumers.
Better scalability: Handles large tables and high-change volumes.
Audit and history: Easier to reconstruct how data changed over time.

CDC in the Databricks Lakehouse

Databricks typically organizes CDC pipelines using the Bronze / Silver / Gold layering pattern on top of Delta Lake tables:

Bronze: Raw, ingested CDC events (from logs, queues, or files).
Silver: Cleaned, merged, de-duplicated tables with current state and history.
Gold: Business-ready aggregates, marts, and feature tables.

Key Databricks Building Blocks for CDC

Delta Lake: Transactional storage format supporting ACID and MERGE operations.
Delta Change Data Feed (CDF): Exposes row-level changes from Delta tables.
Structured Streaming: For continuous ingestion and processing of change events.
Auto Loader: Incrementally processes new files from cloud storage.
Delta Live Tables (DLT): Declarative ETL framework for building CDC-aware pipelines.

Typical CDC Architectures on Databricks

1. Log-Based CDC Into Databricks

In this pattern, a separate CDC tool reads database logs and pushes changes to Databricks.

A CDC tool (e.g., Debezium, Fivetran, etc.) captures database changes (INSERT, UPDATE, DELETE).
Changes are written as events to:
- cloud storage (e.g., JSON/Avro/Parquet files), or
- a message bus (e.g., Kafka), which Databricks then reads.
Databricks Structured Streaming or Auto Loader ingests these events into a Bronze Delta table.
Silver tables apply business logic:
- MERGE events into a target Delta table.
- Maintain both current and historical views.
Gold tables build aggregates and reporting layers.

2. Using Delta Change Data Feed (CDF)

When your source is already a Delta table, you can use Change Data Feed to read only changed rows instead of scanning the full table.

Enable CDF on a Delta table (so it records row-level changes).
Downstream pipelines use CDF to read changes from a given version or timestamp.
Silver/Gold tables consume only the new changes each run.

3. Batch CDC from Snapshots (Pseudo-CDC)

If the source system provides only full snapshots, Databricks can compute changes between current and previous snapshots.

Ingest each snapshot into a Bronze Delta table with a snapshot_date.
Compute differences between the latest and previous snapshots:
- New rows ⇒ inserts
- Changed rows ⇒ updates
- Missing rows ⇒ deletes
Apply results as MERGE operations to the Silver table.

How CDC Is Applied in Delta Tables

Upserts Using MERGE

A common CDC pattern in Databricks is to MERGE change events into a target Delta table:

Match on a business key (e.g., customer_id, order_id).
When a row exists:
- Apply UPDATE or DELETE logic based on the CDC operation type.
When a row does not exist:
- INSERT the new row.

Slowly Changing Dimensions (SCD) with CDC

For dimensional models, Databricks CDC is often combined with SCD Type 1 and SCD Type 2 patterns:

SCD Type 1: Overwrite old values with new values (no history).
SCD Type 2: Keep full history with effective/from and to dates, and a current flag.

CDC with Delta Live Tables (DLT)

Delta Live Tables can simplify CDC implementations by managing dependencies, ordering, and fault tolerance for you.

Define Bronze, Silver, and Gold tables declaratively.
Use streaming or triggered modes to process new CDC data automatically.
Leverage built-in expectations (data quality rules) to validate change records.

Best Practices for Databricks CDC

Use Delta Lake for all CDC tables (Bronze, Silver, Gold).
Partition tables by high-cardinality, time-based columns to optimize incremental reads.
Store immutable change events in Bronze; do not overwrite raw CDC.
Normalize CDC event schema (operation type, timestamps, source metadata).
Monitor pipeline health (lag, error rates, schema drift) in Databricks jobs or DLT.
Plan retention for historical CDC data to balance cost vs. audit requirements.

Summary

Databricks CDC focuses on processing only changed data instead of full reloads.
It uses Delta Lake, MERGE, and optionally CDF and DLT.
Typical architecture uses Bronze/Silver/Gold layers for raw, curated, and business data.
Proper CDC design improves freshness, performance, and cost efficiency of data pipelines on Databricks.