Liquid Clustering is Delta Lake's modern data layout strategy — the successor to Hive-style partitioning and to Z-Ordering. It addresses two longstanding pain points in Lakehouse engineering: partitioning's brittleness when access patterns change, and Z-Order's expensive, all-or-nothing rewrites. Liquid Clustering is incremental, has no fixed boundaries, and lets you change cluster columns without rebuilding the table.
For most new Delta tables, Liquid Clustering is the right default. For most existing tables that suffer from small-files or partition skew, it is the right migration target.
Hive-style partitioning physically subdivides a table into directories by the values of one or more columns: /orders/year=2025/month=04/day=22/. This works beautifully when queries always filter on the partition columns and the cardinality is moderate. It collapses in three failure modes that show up at scale:
region works until product asks for queries by customer_segment, at which point every query becomes a full scan. Repartitioning is a full rewrite.Once partition cardinality goes above a few thousand distinct values, or once any column with skew is used as a partition, performance degrades. The fix is usually to stop partitioning and let the data layout be managed automatically.
Z-Order was Delta's first-generation answer to "I want fast filters on more than one column without partitioning by all of them." It uses a space-filling curve to colocate rows by multiple columns within each file, then relies on Delta's data-skipping min/max statistics to prune files at query time.
Two structural limitations:
OPTIMIZE ZORDER BY rewrites every file in scope to apply the curve. New data added since the last OPTIMIZE is unsorted until the next full pass, which is expensive on large tables.Z-Order is still useful in narrow situations (one-shot historical optimization on a static table) but it is no longer the recommended pattern for evolving tables.
Liquid Clustering replaces both partitioning and Z-Order with a single mechanism. The cluster columns are stored in table metadata. Files are grouped into clusters by those columns, and Delta maintains the layout incrementally as data is written or compacted — new files are clustered in place, and only files that need rebalancing are touched. There are no directory boundaries, no skew traps, and no rewrite of the entire table when the layout drifts.
Properties to internalize:
OPTIMIZE only rewrites files that need it — not the whole table.ALTER TABLE ... CLUSTER BY (...) and subsequent writes use the new columns; an OPTIMIZE migrates older data over time.
Use CLUSTER BY in CREATE TABLE. Do not also use PARTITIONED BY — the two are mutually exclusive on the same table.
CREATE TABLE main.sales.orders (
order_id STRING NOT NULL,
customer_id STRING NOT NULL,
order_ts TIMESTAMP NOT NULL,
region STRING,
product_sku STRING,
amount_usd DECIMAL(18,2),
status STRING
)
CLUSTER BY (customer_id, order_ts);
Picking customer_id first and order_ts second optimizes for the common access pattern "give me everything for customer X in the last 30 days," which is the dominant pattern in transactional analytics.
Convert in two steps: declare cluster columns, then run OPTIMIZE to migrate the data layout. The conversion is online — readers and writers continue to operate.
-- Cannot apply CLUSTER BY directly to a partitioned table without first dropping the partition spec.
-- For an unpartitioned table, you can simply alter:
ALTER TABLE main.sales.orders
CLUSTER BY (customer_id, order_ts);
-- Migrate existing files to the new layout incrementally
OPTIMIZE main.sales.orders;
To change cluster columns later, alter the table again. New writes immediately use the new columns; subsequent OPTIMIZE runs migrate older files.
ALTER TABLE main.sales.orders
CLUSTER BY (region, order_ts);
OPTIMIZE main.sales.orders;
To stop clustering entirely (rare):
ALTER TABLE main.sales.orders CLUSTER BY NONE;
# Create a clustered table from a DataFrame using DataFrameWriterV2
(df.writeTo("main.sales.orders")
.using("delta")
.clusterBy("customer_id", "order_ts")
.create())
# Or via the spark.sql shorthand
spark.sql("""
CREATE TABLE main.sales.orders_v2 (
order_id STRING, customer_id STRING, order_ts TIMESTAMP, amount_usd DECIMAL(18,2)
)
CLUSTER BY (customer_id, order_ts)
""")
# Run OPTIMIZE
spark.sql("OPTIMIZE main.sales.orders")
The rule of thumb: pick the columns that appear in WHERE clauses or join keys for the queries you care about. Within that set, prefer:
customer_id in a multi-tenant table; device_id in IoT.event_ts, order_ts.Avoid:
Liquid Clustering is not "set and forget" — the layout improves only when data is rewritten. There are three ways to drive that:
OPTIMIZE. Schedule it nightly or after large batch loads. Cheap, because it only touches files that need rebalancing.OPTIMIZE and VACUUM based on observed query patterns and write activity. Recommended.OPTIMIZE.
-- Enable Predictive Optimization at the schema level (UC managed tables)
ALTER SCHEMA main.sales ENABLE PREDICTIVE OPTIMIZATION;
-- Or schedule via a job
OPTIMIZE main.sales.orders;
A typical before/after on a 2 TB orders table clustered by (customer_id, order_ts):
-- Before clustering: full scan
SELECT SUM(amount_usd)
FROM main.sales.orders
WHERE customer_id = 'C-7841'
AND order_ts >= current_date() - INTERVAL 30 DAYS;
-- 2 TB scanned, ~45 s
-- After CLUSTER BY (customer_id, order_ts) + OPTIMIZE
-- Same query: 1.4 GB scanned, ~1.2 s
| Property | Partitioning | Z-Order | Liquid Clustering |
|---|---|---|---|
| Layout boundaries | Hard directory boundaries | Within-file sort, no dir boundaries | Soft cluster groups, no dir boundaries |
| Cardinality limit | Low (thousands) | High | High |
| Skew handling | Poor | Better | Good |
| Incremental rewrite | N/A | No (full rewrite per OPTIMIZE) | Yes |
| Change layout columns | Full table rewrite | Full table rewrite | ALTER + incremental migrate |
| Concurrency | File-level conflicts | File-level conflicts | Row-level concurrency |
| Coexists with the others | — | Yes (with partitioning) | No (mutually exclusive) |
Liquid Clustering is the new default, but partitioning has not been deleted from the toolbox. Two scenarios where partitioning is still the right choice:
ALTER TABLE ... DROP PARTITION (date='2023-01-01') is metadata-only.Outside those cases, default to Liquid Clustering on new tables and migrate existing high-cardinality or skewed partitioned tables to it.
Liquid Clustering is a Delta feature that physically organizes table data by chosen clustering keys without creating directory-level partitions. Unlike Hive-style partitioning, it adapts to high-cardinality keys without producing tiny files, supports changing the clustering keys after the table is in production, and avoids the small-files problem that kills over-partitioned tables. It is the right default for most new tables on modern Databricks.
Z-Order is a clustering operation applied periodically by OPTIMIZE ... ZORDER BY on a partitioned table; clustering quality decays between runs and the operation rewrites entire partitions. Liquid Clustering is incremental — new writes are placed near existing clusters automatically — and there is no partition column, so high-cardinality keys are safe. Liquid wins on continuously-loaded tables, mutable tables, and high-cardinality keys; Z-Order is still acceptable on stable date-partitioned append-only tables built before Liquid existed.
Two cases. First, when you genuinely need partition-level operations: drop a year of data with DROP PARTITION, run a job that processes one date at a time, or apply differential retention policies per partition. Second, very large append-only tables with a single low-cardinality date filter on every query — directory partitioning gives the planner free file-skipping with zero metadata cost. Outside these, Liquid is the better default; partitioning becomes a footgun the moment cardinality exceeds a few hundred values per partition column.
Up to four keys, ordered by selectivity in your query workload — the columns that appear most often in WHERE and JOIN predicates. Picking too many dilutes the clustering benefit because the data has to be organized in too many dimensions; picking the wrong ones means OPTIMIZE rewrites the table without speeding up queries. Use the query history in Databricks SQL to find the actual filter columns rather than guessing — it is almost always 1-2 columns plus a date.
The ALTER TABLE ... CLUSTER BY (...) statement itself is metadata-only and instantaneous — no data is rewritten. New writes start clustering on the new keys immediately. Existing files keep their old organization until the next OPTIMIZE, which incrementally re-clusters them. That is the headline win over partitioning: re-partitioning a multi-terabyte table requires a full rewrite and a maintenance window, while changing Liquid keys is a single statement that re-converges over time.
Look at the Spark UI's scan node for the table — specifically the "files read" and "bytes read" vs "files in table" metrics. If a filter on a clustering key reads only a few percent of files, clustering is working; if it reads most files, either the key is wrong or OPTIMIZE has not run recently. Also check the explain plan for "PartitionFilters" / "DataFilters" being pushed down. Run DESCRIBE DETAIL to see the current clustering columns and the last optimization timestamp; if the table has had heavy writes since the last OPTIMIZE, schedule one.