Lance

Lance is an open-source columnar data format optimized for machine-learning and AI workloads. Built by LanceDB, Lance is positioned as a successor to Parquet for ML use cases — cheap random access, native vector embeddings, automatic versioning, and zero-copy reads into PyTorch / NumPy / Arrow without a deserialization step. LanceDB, the embedded vector database, is built directly on the Lance format.

Key Features:

Random-Access Columnar. Unlike Parquet, which is optimized for full-column scans, Lance is built for fast point lookups by row ID — the dominant pattern in ML training (random shuffled batches) and vector retrieval.
Native Vector Type. First-class fixed-size list (vector) columns with built-in ANN indexing — IVF, IVF-PQ, HNSW — stored alongside the data.
Zero-Copy to Arrow. Tables map directly into Arrow buffers; no row-group parsing, no allocation per batch. Drop-in to PyTorch DataLoader and JAX.
Versioning & Time Travel. Every table mutation produces a new version; rollback and reproducible-training queries are first-class.
Schema Evolution. Add or remove columns without rewriting data files.
Cloud Storage Native. S3, GCS, Azure Blob; range reads optimized for object storage.

Lance vs. Parquet:

Lance — Random access by row, vectors as a data type, ML-pipeline-shaped. Newer, less ecosystem support than Parquet today.
Parquet — Column-scan and predicate-pushdown for analytics, the lingua franca of the lakehouse. Sequential reads beat random reads dominantly.
The two coexist: Parquet for analytical / BI tables, Lance for ML feature stores and embedding indexes.

Use Cases:

Embedding feature stores feeding RAG retrieval and recommendation systems.
ML training datasets where shuffle reads dominate — Parquet’s row-group model is a poor fit.
Reproducible model training — pin a dataset version for the duration of a training run.
Vector search via LanceDB — lightweight, embedded, no separate vector-DB server.