Project Nessie

Project Nessie is an open-source “Git-for-data” catalog for Apache Iceberg, started by Dremio. It treats data tables the way Git treats source code — commits, branches, tags, and merges — layered on top of an Iceberg metastore. Nessie makes experimentation, isolation, and rollback first-class operations instead of one-off engineering exercises.

Key Features:

Branches. Create a named branch (etl-2026-04-25) that points at the current state of every table; load data, validate it, then merge into main or discard.
Tags. Pin a named immutable reference to a specific commit — ideal for monthly close, regulatory snapshots, ML training-set versioning.
Atomic Multi-Table Commits. A single commit can change several tables together; either all visible or none. The lakehouse equivalent of a multi-row transaction.
Time Travel by Reference. Query a branch / tag / commit hash directly: SELECT * FROM table @ branch:dev.
Merge & Diff. Compare branches, identify added / changed / removed tables, resolve conflicts.
REST & gRPC APIs. Pluggable into Spark, Flink, Trino, Dremio, and any Iceberg-compatible engine.

Why It Matters:

Traditional lakehouse releases mix data ingestion with data publication — a partial load is visible to consumers immediately. Nessie inverts this: ingestion happens on a feature branch, validation and QA run against that branch, and only a successful merge exposes the change to the production view. The data team gets the same review-and-promote workflow software engineers have had since the late 2000s.

Use Cases:

Pre-production validation of large ETL changes before they hit BI dashboards.
ML training-set versioning — tag the exact commit a model was trained on.
Multi-tenant sandboxes — each analyst gets a branch off main for ad-hoc exploration.
Disaster recovery — instantly point production back to last known-good commit.
Compliance — immutable, auditable history of every table change with author and timestamp.