LanceDB provides two powerful tools for query analysis and optimization: explain_plan and analyze_plan. Let’s take a better look at how they work:
| Method | Purpose | Description |
|---|---|---|
explain_plan |
Query Analysis | Print the resolved query plan to understand how the query will be executed. Helpful for identifying slow queries or unexpected query results. |
analyze_plan |
Performance Tuning | Execute the query and return a physical execution plan annotated with runtime metrics including execution time, number of rows processed, and I/O stats. Essential for performance tuning and debugging. |
Reveals the logical query plan before execution, helping you identify potential issues with query structure and index usage. This tool is useful for:
Executes the query and provides detailed runtime metrics, including:
_elapsed_compute_)_output_rows_, _bytes_read_)_index_comparisons_, _indices_loaded_)_iops_, _requests_)Together, these tools offer a comprehensive view of query performance, from planning to execution. Use explain_plan to verify your query structure and analyze_plan to measure and optimize actual performance.
To demonstrate query performance analysis, we’ll use a table containing 1.2M rows sampled from the Wikipedia dataset . Initially, the table has no indices, allowing us to observe the impact of optimization.
Let’s examine a vector search query that:
identifier is between 0 and 1,000,000chunk_index, title, and identifier# explain_plan
query_explain_plan = (
table.search(query_embed)
.where("identifier > 0 AND identifier < 1000000")
.select(["chunk_index", "title", "identifier"])
.limit(100)
.explain_plan(True)
)// explain_plan
const explainPlan = await table
.search(queryEmbed)
.where("identifier > 0 AND identifier < 1000000")
.select(["chunk_index", "title", "identifier"])
.limit(100)
.explainPlan(true);The execution plan reveals the sequence of operations performed to execute your query. Let’s examine each component:
ProjectionExec: expr=[chunk_index@4 as chunk_index, title@5 as title, identifier@1 as identifier, _distance@3 as _distance]
RemoteTake: columns="vector, identifier, _rowid, _distance, chunk_index, title"
CoalesceBatchesExec: target_batch_size=1024
GlobalLimitExec: skip=0, fetch=100
FilterExec: _distance@3 IS NOT NULL
SortExec: TopK(fetch=100), expr=[_distance@3 ASC NULLS LAST], preserve_partitioning=[false]
KNNVectorDistance: metric=l2
FilterExec: identifier@1 > 0 AND identifier@1 < 1000000
LanceScan: uri=***, projection=[vector, identifier], row_id=true, row_addr=false, ordered=falseLanceScan:
- projection=[vector, identifier]
- row_id=true, row_addr=false, ordered=falseidentifier columnFilterExec: identifier@1 > 0 AND identifier@1 < 1000000KNNVectorDistance: metric=l2SortExec: TopK(fetch=100)
- expr=[_distance@3 ASC NULLS LAST]
- preserve_partitioning=[false]
FilterExec: _distance@3 IS NOT NULL
GlobalLimitExec: skip=0, fetch=100
CoalesceBatchesExec: target_batch_size=1024RemoteTake is a key component of Lance’s I/O cacheRemoteTake: columns="vector, identifier, _rowid, _distance, chunk_index, title"ProjectionExec: expr=[chunk_index@4 as chunk_index, title@5 as title, identifier@1 as identifier, _distance@3 as _distance]This plan demonstrates a basic search without index optimizations: it performs a full scan and filter before vector search.
Let’s use analyze_plan to run the query and analyze the query performance, which will help us identify potential bottlenecks:
# analyze_plan
query_analyze_plan = (
table.search(query_embed)
.where("identifier > 0 AND identifier < 1000000")
.select(["chunk_index", "title", "identifier"])
.limit(100)
.analyze_plan()
)// analyze_plan
const analyzePlan = await table
.search(queryEmbed)
.where("identifier > 0 AND identifier < 1000000")
.select(["chunk_index", "title", "identifier"])
.limit(100)
.analyzePlan();ProjectionExec: expr=[chunk_index@4 as chunk_index, title@5 as title, identifier@1 as identifier, _distance@3 as _distance], metrics=[output_rows=100, elapsed_compute=1.424µs]
RemoteTake: columns="vector, identifier, _rowid, _distance, chunk_index, title", metrics=[output_rows=100, elapsed_compute=175.53097ms, output_batches=1, remote_takes=100]
CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=100, elapsed_compute=2.748µs]
GlobalLimitExec: skip=0, fetch=100, metrics=[output_rows=100, elapsed_compute=1.819µs]
FilterExec: _distance@3 IS NOT NULL, metrics=[output_rows=100, elapsed_compute=10.275µs]
SortExec: TopK(fetch=100), expr=[_distance@3 ASC NULLS LAST], preserve_partitioning=[false], metrics=[output_rows=100, elapsed_compute=39.259451ms, row_replacements=546]
KNNVectorDistance: metric=l2, metrics=[output_rows=1099508, elapsed_compute=56.783526ms, output_batches=1076]
FilterExec: identifier@1 > 0 AND identifier@1 < 1000000, metrics=[output_rows=1099508, elapsed_compute=17.136819ms]
LanceScan: uri=***, projection=[vector, identifier], row_id=true, row_addr=false, ordered=false, metrics=[output_rows=1200000, elapsed_compute=21.348178ms, bytes_read=1852931072, iops=78, requests=78]vector and identifier)identifier > 0 AND identifier < 1000000)After creating vector and scalar indices, the execution plan shows:
ProjectionExec: expr=[chunk_index@3 as chunk_index, title@4 as title, identifier@2 as identifier, _distance@0 as _distance]
RemoteTake: columns="_distance, _rowid, identifier, chunk_index, title"
CoalesceBatchesExec: target_batch_size=1024
GlobalLimitExec: skip=0, fetch=100
SortExec: TopK(fetch=100), expr=[_distance@0 ASC NULLS LAST], preserve_partitioning=[false]
ANNSubIndex: name=vector_idx, k=100, deltas=1
ANNIvfPartition: uuid=83916fd5-fc45-4977-bad9-1f0737539bb9, nprobes=20, deltas=1
ScalarIndexQuery: query=AND(identifier > 0,identifier < 1000000)ScalarIndexQuery: query=AND(identifier > 0,identifier < 1000000)
metrics=[
output_rows=2
index_comparisons=2,301,824
indices_loaded=2
output_batches=1
parts_loaded=562
elapsed_compute=86.979354ms
]ANNSubIndex: name=vector_idx, k=100, deltas=1
metrics=[
output_rows=2,000
index_comparisons=25,893
indices_loaded=0
output_batches=20
parts_loaded=20
elapsed_compute=111.849043ms
]SortExec: TopK(fetch=100), expr=[_distance@0 ASC NULLS LAST], preserve_partitioning=[false]
GlobalLimitExec: skip=0, fetch=100
CoalesceBatchesExec: target_batch_size=1024RemoteTake: columns="_distance, _rowid, identifier, chunk_index, title"
metrics=[output_rows=100, elapsed_compute=113.491859ms, output_batches=1, remote_takes=100]ProjectionExec: expr=[chunk_index@3 as chunk_index, title@4 as title, identifier@2 as identifier, _distance@0 as _distance]ScalarIndexQuery metrics:
- indices_loaded=2
- parts_loaded=562
- output_batches=1ANNSubIndex:
- index_comparisons=25,893
- indices_loaded=0
- parts_loaded=20
- output_batches=20RemoteTake:
- remote_takes=100
- output_batches=1merge_insert| Data Type | Recommended Index | Use Case |
|---|---|---|
| Vector | IVF_PQ/IVF_HNSW_SQ | Approximate nearest neighbor search |
| Scalar | B-Tree | Range queries and sorting |
| Categorical | Bitmap | Multi-value filters and set operations |
List |
Label_list | Multi-label classification and filtering |
!!! info “Index Coverage Monitoring”
Use table.index_stats() to monitor index coverage.
A well-optimized table should have num_unindexed_rows ~ 0.
| Plan Pattern | Optimization |
|---|---|
| LanceScan with high bytes_read or iops | Add missing index |
Use select() to limit returned columns |
|
| Check whether the dataset has been compacted | |
| Multiple sequential filters | Reorder filter conditions |
!!! note “Regular Performance Analysis”
Regularly analyze your query plans to identify and address performance bottlenecks.
The analyze_plan output provides detailed metrics to guide optimization efforts.
For vector search performance: