Lance
Lance is an open-source columnar data format optimized for machine-learning and AI workloads. Built by LanceDB, Lance is positioned as a successor to Parquet for ML use cases — cheap random access, native vector embeddings, automatic versioning, and zero-copy reads into PyTorch / NumPy / Arrow without a deserialization step. LanceDB, the embedded vector database, is built directly on the Lance format.
Key Features:
- Random-Access Columnar. Unlike Parquet, which is optimized for full-column scans, Lance is built for fast point lookups by row ID — the dominant pattern in ML training (random shuffled batches) and vector retrieval.
- Native Vector Type. First-class fixed-size list (vector) columns with built-in ANN indexing — IVF, IVF-PQ, HNSW — stored alongside the data.
- Zero-Copy to Arrow. Tables map directly into Arrow buffers; no row-group parsing, no allocation per batch. Drop-in to PyTorch
DataLoader and JAX.
- Versioning & Time Travel. Every table mutation produces a new version; rollback and reproducible-training queries are first-class.
- Schema Evolution. Add or remove columns without rewriting data files.
- Cloud Storage Native. S3, GCS, Azure Blob; range reads optimized for object storage.
Lance vs. Parquet:
- Lance — Random access by row, vectors as a data type, ML-pipeline-shaped. Newer, less ecosystem support than Parquet today.
- Parquet — Column-scan and predicate-pushdown for analytics, the lingua franca of the lakehouse. Sequential reads beat random reads dominantly.
- The two coexist: Parquet for analytical / BI tables, Lance for ML feature stores and embedding indexes.
Use Cases:
- Embedding feature stores feeding RAG retrieval and recommendation systems.
- ML training datasets where shuffle reads dominate — Parquet’s row-group model is a poor fit.
- Reproducible model training — pin a dataset version for the duration of a training run.
- Vector search via LanceDB — lightweight, embedded, no separate vector-DB server.