Delta Live Tables (DLT) in Databricks

Delta Live Tables (DLT) in Databricks is a framework for building reliable, scalable, and simple data pipelines. It is built on top of Delta Lake and simplifies creating, managing, and monitoring data pipelines.

Key Features of Delta Live Tables

Declarative Pipeline Development: Define transformations in a declarative way. Databricks manages dependencies and execution optimizations.

from pyspark.sql.functions import col

@dlt.table
def clean_data():
    return spark.read("path/to/raw_data").filter(col("age") > 18)

Managed Pipelines: DLT handles the entire lifecycle of a pipeline, including monitoring, failure handling, and data lineage tracking.
Incremental Data Processing: Supports incremental processing of new data, improving efficiency, especially for streaming workloads.

Automatic Data Quality Checks: Define data quality constraints to ensure that only valid data enters the pipeline.

@dlt.expect("valid_age", "age > 0")
def clean_data():
    return spark.read("path/to/raw_data").filter(col("age") > 18)

Delta Lake Integration: Since DLT is built on Delta Lake, it benefits from Delta's ACID transactions, time travel, and schema enforcement.
Pipeline Monitoring and Observability: Built-in tools provide visibility into the health and performance of the pipeline.
Batch and Streaming Support: DLT supports both batch and streaming data sources, providing flexibility in how data is processed.
Ease of Use: The declarative syntax simplifies the creation of ETL pipelines by abstracting away much of the complex orchestration.

Example Pipeline Workflow

import dlt
from pyspark.sql.functions import *

# Ingest raw data
@dlt.table
def raw_data():
    return spark.read("path/to/raw_data")

# Clean data by filtering invalid rows
@dlt.table
def clean_data():
    return dlt.read("raw_data").filter(col("age") > 18)

# Aggregate the cleaned data
@dlt.table
def aggregated_data():
    return dlt.read("clean_data").groupBy("country").agg(count("*").alias("user_count"))

Use Cases for Delta Live Tables

ETL Pipelines: Automate data ingestion, transformation, and output for analytics or machine learning.
Data Quality Enforcement: Enforce rules to ensure only clean and valid data is processed.
Real-Time Streaming: Handle real-time data ingestion for immediate processing and analytics.
Simplified Data Pipelines: Reduce manual operational overhead with automatic dependency management and optimizations.

Common Interview Questions:

What is the difference between a declarative DLT pipeline and an imperative Spark job?

An imperative job specifies the steps: read source, run these transformations, write here, in this order. DLT is declarative — you define each table as a function returning a DataFrame, and the framework infers the dependency DAG, schedules the steps, retries failures, and tracks lineage. The win is operational: the same pipeline that took hundreds of lines of orchestration code in Airflow becomes a handful of decorated Python functions, and DLT owns the DAG instead of the developer.

What are DLT expectations and what are the three modes?

Expectations are declarative data quality rules attached to a table. @dlt.expect("rule", "predicate") records violations to event logs but lets bad rows through. @dlt.expect_or_drop filters violating rows out of the output silently (with metrics). @dlt.expect_or_fail aborts the pipeline run if any row violates. Use expect for monitoring, expect_or_drop for non-critical attributes, expect_or_fail for invariants you cannot tolerate (a primary key being null).

How does DLT handle change data capture from a source?

dlt.apply_changes (formerly APPLY CHANGES INTO) ingests a stream of inserts, updates, and deletes from a CDC source and applies them to a target Delta table — handling out-of-order events, deduplication, and SCD Type 1 or Type 2 history automatically. You declare the keys, the sequence column (e.g., the source LSN or timestamp), and the apply mode; DLT does the rest. This replaces hundreds of lines of hand-written MERGE logic that was previously the most error-prone part of every CDC pipeline.

When does DLT beat a plain Databricks Workflow with Spark notebooks?

When the pipeline is medallion-shaped (Bronze → Silver → Gold with clear table boundaries), when data quality rules need to be visible and tracked, when CDC is involved, and when the team is small enough that owning custom Airflow-style orchestration is a tax. Plain Workflows still win for pipelines that are mostly ML training, mostly Python business logic without intermediate tables, or that need precise control over cluster lifecycle and library installation. As a rule, if your pipeline is mostly DataFrame reads and writes with quality checks, DLT is the cleaner expression.

What is the difference between a streaming live table and a materialized view in DLT?

A streaming live table processes each input row exactly once — it tracks state via Spark Structured Streaming checkpoints and is appropriate for append-mostly Bronze and incremental Silver. A materialized view recomputes the full result on each pipeline run (or incrementally if the query is supported) — it is appropriate for Gold aggregates that need to reflect the latest state of multiple upstream tables. Pick streaming for anything fed by Auto Loader or another stream; pick MV for aggregations and joins that need to see the full upstream table.

What are the limits of DLT that send people back to plain jobs?

The pipeline owns its own compute — you cannot share a cluster across multiple DLT pipelines, which costs more than tightly packed all-purpose cluster jobs. The runtime is Databricks-specific, so the same code does not run on a vanilla Spark cluster. Cluster customization is constrained: certain init scripts and libraries that work on regular Spark may not be supported. And DLT pricing is on top of standard DBU rates, which matters at scale. For most data engineering teams these tradeoffs are worth the operational simplification, but not for every workload.