π¦ Lance x DuckDB: SQL for Retrieval on the Multimodal Lakehouse Format
The Lance extension for DuckDB turns DuckDB into a SQL compute engine over Lance datasets, exposing vector, full-text, and hybrid retrieval as SQL table functions. This enables fully composable retrieval workflows β joins with eval data, reproducible top-k slicing, SQL-based debugging, and materialization back into Lance.
This extension bridges traditional SQL analytics with multimodal retrieval on a single open dataset format.
π Rethinking Table File Paths with Uber: Lanceβs Multi-Base Layout
Working with Uberβs AI Infrastructure team, Lance introduced a multi-base layout to support product systems that need a single dataset to span multiple S3 buckets for parallel reads and writes.
By separating storage bases from file references, Lance enables multi-bucket and multi-region layouts with compact, relocatable metadata β allowing Uber to scale training and retrieval workloads without fragmenting datasets or rewriting metadata.
π The Quest for One Million IOPS: Benchmarking Storage at Lance
Recent storage benchmarks in Lance reached up to 1.5 million IOPS by combining a scheduler rework with io_uring, showing that high random-access throughput depends more on reducing CPU overhead and context switching than on single-read latency.
This blog explains how this design better drives modern NVMe hardware for vector, text, and key-based lookups, and contrasts embedded and disaggregated architectures to show how LanceDB scales from single-process deployments to large, distributed systems.
π Also Published This Month
- One System, Many Workloads: Rethinking What “Multimodal” Means for AI
- Apache Polaris and Lance: Bringing AI-Native Storage to the Open Multimodal Lakehouse
- Keep Your Data Fresh with CocoIndex and LanceDB
π Upcoming Events
February Open Data + AI Meetup - Peninsula, Bay Area Edition β Thursday, February 12
Hear from speakers from LanceDB, Fivetran, Dremio, and typedef about what they’re building and how they’re defining the future of open data and AI.
NYC Lakehouse Meetup β Tuesday, February 17
βWe’re bringing together Apache Iceberg, Lance, and Apache DataFusion communities in NYC to chat about all things open lakehouse and data infrastructure at Cloudflare’s NYC office!
ποΈ LanceDB Enterprise Updates
| Feature | Description |
|---|---|
| Add page cache prewarm API | Users can prewarm LanceDB tables using a LanceDB administrative API. (It is also possible to prewarm some columns, but not others.) This is useful for cases where we want to ensure that data is in the page cache prior to running a specific workload. It is also useful for benchmarking. |
| Admission Control for Feature Engineering Jobs | Avoid deadlocks by rejecting jobs if the cluster does not have enough resources to execute the job. |
| Adaptive Batch sizing for Feature Engineering Job checkpoints |
Backfill jobs now change checkpoint size depending on udf execution time. Internal benchmarks show up to 2x performance improvements. |
π Open Source Releases
| Project | Description |
|---|---|
| Lance v1.0.1 - v1.0.4 Release notes |
β’ Multi-base storage layouts enabling a single dataset to span multiple buckets or regions for parallel reads and writes (
#5790
,
#5801
) β’ Faster query execution via tighter WAND block score bounds and reduced per-query overhead ( #5668 , #5696 ) |
| LanceDB v0.26 - v0.28 Release notes |
β’ DuckDB-powered SQL retrieval with vector, FTS, and hybrid search exposed as composable table functions (
#2946
,
#2957
) β’ Expanded embedding support (VoyageAI v4, multimodal) and improved ingestion robustness via parallel embedding computation and better remote query cancellation ( #2959 , #2887 , #2896 , #2913 ) |
| lance-graph v0.4.0 - v0.5.0 Release notes |
β’ Significantly expanded Cypher expressiveness with WITH clause chaining, COLLECT, and COUNT(DISTINCT β¦) support (
#86
,
#85
,
#116
)β’ Integrated vector search and similarity UDFs into graph queries, with improved execution efficiency on object stores ( #80 , #81 , #83 , #89 , #96 ) |
| lance-context v0.2.0 - v0.2.1 Release notes |
β’ Core context store APIs for append, search, and versioned checkout across Python and Rust (
#6
,
#11
,
#12
,
#24
) β’ Improved runtime behavior with multimodal context support, background compaction, and reduced Python-side blocking during remote I/O ( #9 , #28 , #29 ) |
| lance-duckdb v0.4.1 - v0.5.0 Release notes |
β’ Improved DuckDB integration with global aggregate pushdown and expanded vector search ergonomics, including ARRAY-based query vectors and tuning controls ( #124 , #119 , #120 ) |
| lance-namespace v0.4.4 - v0.4.5 Release notes |
β’ New Lance partitioning specification for defining and operating on partitioned datasets ( #279 , #297 ) |
| lance-ray v0.1.0 - v0.2.0 Release notes |
β’ Distributed Ray-based IVF_SQ / PQ / FLAT index builder for scalable, parallel index creation ( #67 ) |
| lance-spark v0.2.0 Release notes |
β’ Spark MERGE INTO support for upserts and deletes, plus vector search and distributed index creation for large-scale Spark pipelines (
#172
,
#189
,
#171
) |
π«Ά Community Contributions
Thank you to contributors from Uber, Netflix, Hugging Face, Bytedance, Huawei, Tencent, and Alibaba for improvements across embeddings, query robustness, storage compatibility, distributed indexing, Spark integration, and core format reliability in LanceDB, Lance, lance-spark, and lance-ray.
Notable contributions this month:
- @fzowl β Added support for VoyageAI v4 and multimodal models, expanding first-class embedding options in LanceDB.
-
@dcfocus
β Delivered major Cypher features in lance-graph, including
COLLECTaggregation,WITHclause query chaining, and foundational context APIs. -
@ChunxuTang
β Expanded Cypher query capabilities with
COUNT(DISTINCT β¦), case-insensitive matching, and vector search operators. - @beinan β Improved execution efficiency and deployability across lance-graph and lance-context, enabling more scalable production deployments.
- @jja725 β Implemented background compaction for Lance fragments, improving long-running system performance.
- @ex172000 β Improved performance and correctness through executor fixes and parallelized embedding computation.
- @fatelei β Prevented Python-side blocking by releasing the GIL during remote storage operations.
- @wojiaodoubao β Introduced the Lance partitioning specification, enabling native support for partitioned datasets.
- @chenghao-guo β Implemented a Ray-based distributed IVF index builder, enabling scalable index construction.
-
@nyl3532016
β Added vector search support to
lance-spark, enabling similarity search in Spark pipelines. - @jiaoew1991 β Built a fragment-aware join optimizer to improve Spark query performance on Lance datasets.
-
@jtuglu1
β Implemented distributed full-text search index creation in
lance-spark. -
@bryanck
β Improved stability of
lance-sparkby fixing Kryo serialization and classloader issues. -
@zhangyue19921010
β Implemented Spark
MERGE INTOsupport for upsert and delete operations on Lance tables.
We want to especially highlight the initial release of lance-context contributed by Uber.
A heartfelt thank you to our community contributors of Lance and LanceDB this past month:
@fzowl β’ @dcfocus β’ @ChunxuTang β’ @beinan β’ @jja725 β’ @ex172000 β’ @hushengquan β’ @fatelei β’ @ddupg β’ @Mesut-Doner β’ @amanharshx β’ @Angryrou β’ @youssef-tharwat β’ @leiyuou β’ @prrao87 β’ @fenfeng9 β’ @chyyran β’ @camilesing β’ @zhangyue19921010 β’ @touch-of-grey β’ @fredlarochelle β’ @LuciferYang β’ @lhoestq β’ @majin1102 β’ @yanghua β’ @wojiaodoubao β’ @lichuang β’ @Ke-Wang β’ @niebayes β’ @HaochengLIU β’ @markmcd β’ @chenghao-guo β’ @nyl3532016 β’ @jiaoew1991 β’ @jtuglu1 β’ @bryanck β’ @fangbo β’ @majian1998 β’ @hamersaw
π€ Lance Community Sync Recap
In January, we held two Lance Community Syncs focused on the upcoming Lance 2.0.0 release (now at RC4 and approaching final community vote), growing ecosystem integrations with DuckDB , Polaris , and Hugging Face , and the formalization of lance-context and lance-graph as official sub-projects.
We also discussed recent performance work across Spark, vector indexing, and WAL/mem-table updates, alongside forward-looking proposals covering schema semantics, metadata visibility, clustering strategies, and a new Incubator governance stage for emerging projects.
The next Lance Community Sync will take place on Thursday, February 12, 2026.
- π¬ Subscribe to Lance mailing list to receive the meeting invite β
- π Add discussion topics to the meeting notes β
- βΆοΈ Watch previous recordings: Jan 15 | Jan 29


