The Native S3 Vector Database
LanceDB lets you run a vector database directly on S3 by storing Lance tables in your own buckets.
S3 remains the storage layer and source of truth; LanceDB is the distributed vector database built with a lakehouse architecture. By storing your primary data on S3, you can reduce costs by up to 200x, while serving millions of tables and tens of billions of rows in a single index.
Vectors, metadata, and pointers to raw objects share one schema and one lifecycle on S3. You get database-style retrieval over your lake without introducing another storage tier.
Stop Duplicating Your Data on S3
Most existing stacks follow the same pattern:
- Read data from S3
- Bulk-create and persist embeddings to disk
- Ingest embeddings and metadata into a separate vector store
- Maintain a link between the metadata and multimodal data that sits in object store
- Keep the vector database in sync with the source data in S3 forever
You essentially pay for storage twice and have to manage additional ETL workloads to keep the index and embeddings up to date.
With LanceDB:
- Multimodal blobs, embeddings and metadata are all stored together, in Lance tables on S3, not inside a proprietary data format
- Object references, features, and labels live in the same table, backed by your governance processes
- Pipelines write once to S3; query services read from there
That means less data movement, fewer jobs to maintain, and a simpler answer to “where do our embeddings live?” Storage cost is tied to S3, not to a second vendor tier.
S3 Object Store Vector Retrieval
Standard columnar formats were not designed for scattered, small reads over object storage. On S3 they tend to cache whole files and burn compute just to fetch a few rows.
Because it’s built on the open Lance table format, you can use the same tables for training, evaluation, and production retrieval—one copy of data, not separate systems.
LanceDB is built on the Lance format, tuned for S3 vector workloads:
- Columnar layout + indexing that minimizes S3 round trips for random access
- Orders-of-magnitude higher random-access throughput than Parquet-style layouts for small, scattered reads, so the same dataset can serve both ML pipelines and retrieval workloads
- Support for training, shuffling, evaluation, and online search directly from S3, without staging the full dataset to local SSD
LanceDB’s indexing mechanism is designed for datasets larger than RAM, and are entirely disk-based, scaling to multi-terabyte and petabyte-scale lakes stored entirely on S3. You can point training jobs and online services at the same S3-backed datasets and keep I/O overhead (cost) under control.
Built for AWS S3 Vector Search Workloads
For most teams on AWS, S3 is the primary data lake storage. LanceDB is designed to run on top of that data, using S3 as the storage layer for its tables instead of creating a separate vector-only store.
Typical patterns:
- Large-scale vision and multimodal: Image, audio and video blobs live on S3, within the same Lance tables as your features for ML training. Queries run over those features without relocating the blobs.
- RAG over large corpora: Documents, PDFs and logs are stored in S3. Chunks, embeddings, and access control metadata are stored as Lance tables in the same buckets. Retrieval behaves like “S3 vector search” over your lake instead of a copy in a separate system.
- Long-term archives: Historical datasets stay on S3 tiers. LanceDB query services touch only the vectors needed for a run instead of keeping everything hot in RAM-heavy clusters.
LanceDB runs in your VPC, close to S3, and exposes a simple client API over data that never leaves your account.
Compute–Storage Separation
Vector workloads are spiky. Training and evaluation are batch heavy; online search has distinct peaks. A fixed, stateful cluster is an expensive fit.
LanceDB keeps storage on S3 and treats compute as stateless:
- Query services scale up, down, or to zero based on traffic
- Tables, metadata, and vectors live as Lance files on S3 as the only durable state
- New query nodes can attach to the same buckets with minimal warm-up; they don’t preload the full dataset
For infra and data teams, this means:
- You size compute for current QPS, not total dataset size
- Storage costs follow S3 pricing instead of a separate hot storage tier
- Recovery and environment cloning are simple: point LanceDB at the same S3 paths
You get database behavior with lake economics.
Trusted by Data-Heavy AI Teams
“Lance has been a significant enabler for our multimodal data workflows. Its performance and feature set offer a dramatic step up from legacy formats like WebDataset and Parquet. Using Lance has freed up considerable time and energy for our team, allowing us to iterate faster and focus more on research.” - Keunhong Park, Member of Technical Staff, World Labs
Own Your Data and Your Cost Curve
If you are designing a new stack or moving off a cluster-centric vector database, LanceDB on S3 gives you:
- An S3-native vector database model where the lake remains the source of truth
- Object-store-optimized vector retrieval for random access and training workloads
- AWS S3 vector search capabilities without copying data into a separate storage layer
Because LanceDB and the Lance file format are open source, the same S3-backed tables work in embedded, self-hosted, and managed deployments without rewriting or exporting data. Your objects and embeddings stay in S3, under your account and lifecycle policies. LanceDB is the engine that makes that data searchable at scale.