Project Nessie
Project Nessie is an open-source “Git-for-data” catalog for Apache Iceberg, started by Dremio. It treats data tables the way Git treats source code — commits, branches, tags, and merges — layered on top of an Iceberg metastore. Nessie makes experimentation, isolation, and rollback first-class operations instead of one-off engineering exercises.
Key Features:
- Branches. Create a named branch (
etl-2026-04-25) that points at the current state of every table; load data, validate it, then merge into main or discard.
- Tags. Pin a named immutable reference to a specific commit — ideal for monthly close, regulatory snapshots, ML training-set versioning.
- Atomic Multi-Table Commits. A single commit can change several tables together; either all visible or none. The lakehouse equivalent of a multi-row transaction.
- Time Travel by Reference. Query a branch / tag / commit hash directly:
SELECT * FROM table @ branch:dev.
- Merge & Diff. Compare branches, identify added / changed / removed tables, resolve conflicts.
- REST & gRPC APIs. Pluggable into Spark, Flink, Trino, Dremio, and any Iceberg-compatible engine.
Why It Matters:
Traditional lakehouse releases mix data ingestion with data publication — a partial load is visible to consumers immediately. Nessie inverts this: ingestion happens on a feature branch, validation and QA run against that branch, and only a successful merge exposes the change to the production view. The data team gets the same review-and-promote workflow software engineers have had since the late 2000s.
Use Cases:
- Pre-production validation of large ETL changes before they hit BI dashboards.
- ML training-set versioning — tag the exact commit a model was trained on.
- Multi-tenant sandboxes — each analyst gets a branch off
main for ad-hoc exploration.
- Disaster recovery — instantly point production back to last known-good commit.
- Compliance — immutable, auditable history of every table change with author and timestamp.