DuckLake
DuckLake is a 2025 lakehouse format from the DuckDB team. Its central design choice is unusual: instead of storing table metadata as JSON / Avro files in the data directory (the Iceberg / Delta / Hudi pattern), DuckLake puts the catalog metadata in a regular SQL database (any Postgres, MySQL, SQLite, or DuckDB). Data files remain Parquet on object storage. The result is a single small binary plus a Postgres — no HMS, no REST catalog server, no JVM — with full ACID semantics on the lakehouse.
Key Features:
- SQL-Database Catalog. Metadata lives in real tables, queryable with SQL. Inspecting the table layout means
SELECT * FROM ducklake_metadata, not parsing JSON files.
- ACID Transactions. Multi-table commits via a single SQL transaction in the metadata DB — the database does the heavy lifting Iceberg builds itself.
- Time Travel & Schema Evolution. Standard lakehouse table-format features.
- Single Binary. DuckDB + DuckLake extension. No JVM, no Spark cluster, no HMS process. Run on a laptop or a Lambda.
- Open Spec. Format and protocol are open; other engines can implement readers.
- Object-Storage-Native. Parquet on S3 / GCS / R2; metadata DB anywhere with a JDBC driver.
Why It’s Notable:
The big three open table formats (Iceberg, Hudi, Delta) reinvented database internals on top of object storage — transaction logs, optimistic concurrency, snapshot isolation, all hand-built. DuckLake takes the opposite philosophical approach: object storage holds the data, a real database holds the metadata, and you compose two well-understood systems instead of building a new one. It’s the simplest lakehouse design currently shipping.
Trade-offs:
- Pros. Drastically simpler operations, instant ACID via SQL transactions, easier introspection, no metadata-only catalog server to run.
- Cons. Catalog DB becomes a coordination bottleneck at very high write concurrency; ecosystem is brand new (2025); fewer engines support it today.
Use Cases:
- Small-to-medium lakehouses where the operational overhead of Iceberg + Polaris is overkill.
- Embedded / single-node analytics on top of Parquet on S3.
- Greenfield projects starting in 2025+ that want the simplest possible lakehouse stack.
- Data-engineering teams already standardized on DuckDB for local development.
DuckLake is brand new — treat it as a credible architectural option to track, not yet as a default production choice.