LanceDB is built on top of the Lance columnar data format, which provides the foundation for its multimodal capabilities. Lance combines the performance of Apache Arrow with advanced features designed specifically for AI workloads.
Looking for Lance columnar format docs? Click here.
The Lance format enables LanceDB to serve as a unified data store that eliminates the need for separate databases. Unlike traditional vector databases that only store embeddings, LanceDB can store both the original data and its vector representations in the same efficient format.
| Advantage | Description |
|---|---|
| Multimodal Storage | Efficiently holds vectors, images, videos, audio, text, and more |
| Version Control | Built-in data versioning for reproducible ML experiments and data lineage |
| ML-Optimized | Designed for training and inference workloads with fast random access |
| Query Performance | Columnar storage enables blazing-fast vector search and analytics |
| Cloud-Native | Seamless integration with cloud object stores (S3, GCS, Azure Blob) |
The following concepts are important to keep in mind:
First, each version contains metadata and just the new/updated data in your transaction. So if you have 100 versions, they aren’t 100 duplicates of the same data. However, they do have 100x the metadata overhead of a single version, which can result in slower queries.
Second, these versions exist to keep LanceDB scalable and consistent. We do not immediately blow away old versions when creating new ones because other clients might be in the middle of querying the old version. It’s important to retain older versions for as long as they might be queried.
As you insert more data, your dataset will grow and you’ll need to perform compaction to maintain query throughput (i.e., keep latencies down to a minimum). Compaction is the process of merging fragments together to reduce the amount of metadata that needs to be managed, and to reduce the number of files that need to be opened while scanning the dataset.
Compaction performs the following tasks in the background:
Depending on the use case and dataset, optimal compaction will have different requirements. As a rule of thumb:
Although Lance allows you to delete rows from a dataset, it does not actually delete the data immediately. It simply marks the row as deleted in the DataFile that represents a fragment.
For a given version of the dataset, each fragment can have up to one deletion file (if no rows were ever deleted from that fragment, it will not have a deletion file). This is important to keep in mind because it means that the data is still there, and can be recovered if needed, as long as that version still exists based on your backup policy.