Delta Live Tables (DLT) in Databricks is a framework for building reliable, scalable, and simple data pipelines. It is built on top of Delta Lake and simplifies creating, managing, and monitoring data pipelines.
from pyspark.sql.functions import col
@dlt.table
def clean_data():
return spark.read("path/to/raw_data").filter(col("age") > 18)
@dlt.expect("valid_age", "age > 0")
def clean_data():
return spark.read("path/to/raw_data").filter(col("age") > 18)
import dlt
from pyspark.sql.functions import *
# Ingest raw data
@dlt.table
def raw_data():
return spark.read("path/to/raw_data")
# Clean data by filtering invalid rows
@dlt.table
def clean_data():
return dlt.read("raw_data").filter(col("age") > 18)
# Aggregate the cleaned data
@dlt.table
def aggregated_data():
return dlt.read("clean_data").groupBy("country").agg(count("*").alias("user_count"))
An imperative job specifies the steps: read source, run these transformations, write here, in this order. DLT is declarative — you define each table as a function returning a DataFrame, and the framework infers the dependency DAG, schedules the steps, retries failures, and tracks lineage. The win is operational: the same pipeline that took hundreds of lines of orchestration code in Airflow becomes a handful of decorated Python functions, and DLT owns the DAG instead of the developer.
Expectations are declarative data quality rules attached to a table. @dlt.expect("rule", "predicate") records violations to event logs but lets bad rows through. @dlt.expect_or_drop filters violating rows out of the output silently (with metrics). @dlt.expect_or_fail aborts the pipeline run if any row violates. Use expect for monitoring, expect_or_drop for non-critical attributes, expect_or_fail for invariants you cannot tolerate (a primary key being null).
dlt.apply_changes (formerly APPLY CHANGES INTO) ingests a stream of inserts, updates, and deletes from a CDC source and applies them to a target Delta table — handling out-of-order events, deduplication, and SCD Type 1 or Type 2 history automatically. You declare the keys, the sequence column (e.g., the source LSN or timestamp), and the apply mode; DLT does the rest. This replaces hundreds of lines of hand-written MERGE logic that was previously the most error-prone part of every CDC pipeline.
When the pipeline is medallion-shaped (Bronze → Silver → Gold with clear table boundaries), when data quality rules need to be visible and tracked, when CDC is involved, and when the team is small enough that owning custom Airflow-style orchestration is a tax. Plain Workflows still win for pipelines that are mostly ML training, mostly Python business logic without intermediate tables, or that need precise control over cluster lifecycle and library installation. As a rule, if your pipeline is mostly DataFrame reads and writes with quality checks, DLT is the cleaner expression.
A streaming live table processes each input row exactly once — it tracks state via Spark Structured Streaming checkpoints and is appropriate for append-mostly Bronze and incremental Silver. A materialized view recomputes the full result on each pipeline run (or incrementally if the query is supported) — it is appropriate for Gold aggregates that need to reflect the latest state of multiple upstream tables. Pick streaming for anything fed by Auto Loader or another stream; pick MV for aggregations and joins that need to see the full upstream table.
The pipeline owns its own compute — you cannot share a cluster across multiple DLT pipelines, which costs more than tightly packed all-purpose cluster jobs. The runtime is Databricks-specific, so the same code does not run on a vanilla Spark cluster. Cluster customization is constrained: certain init scripts and libraries that work on regular Spark may not be supported. And DLT pricing is on top of standard DBU rates, which matters at scale. For most data engineering teams these tradeoffs are worth the operational simplification, but not for every workload.