It’s not about the vectors. It’s about getting the right result.
Many of our users are building RAG and search apps, and they want three things above all: precision, scale, and simplicity. In this article, we introduce WikiSearch , our flagship demo that delivers all with minimal code .
WikiSearch is a very simple search engine that stores and searches through real Wikipedia entries. You don’t see it, but there is a lot of content sitting in LanceDB Cloud - and we use Full Text Search to go through it. Vector search is still there for semantic relevance, and we merge both into a powerful Hybrid Search solution .
Scaling to 41 million documents in production presented significant engineering challenges. Here are the key performance breakthroughs we achieved:
Metric | Performance |
---|---|
Ingestion | We processed 60,000+ documents per second with distributed GPU processing |
Indexing | We built vector indexes on 41M documents in just 30 minutes |
Write Bandwidth | We sustained 4 GB/s peak write rates for real-time applications |
Why Full-Text Search Helped
Full-Text Search (FTS) lets you find the exact words, phrases, and spellings people care about. It complements vector search by catching precise constraints, rare terms, and operators (phrases, boolean logic, field boosts) that embeddings alone often miss.
It works by tokenization: splitting text into small, searchable pieces called tokens. It lowercases words, removes punctuation, and can reduce words to a base form (e.g., “running” → “run”).
Here is how basic stemming is enabled for an English-language text:
table.create_fts_index("text", language="English", replace=True)
This request creates and stores tokens into an inverted (FTS) index. The tokenizer you choose can be standard, language-aware, or n-gram & more . Configuration directly shapes recall and precision, so you have a lot of freedom to play around with the parameters and match them to your use case.
FTS handles multilingual text, too. For French, enable ascii_folding
during index creation to strip accents (e.g., “é” → “e”), so queries match words regardless of diacritics.
table.create_fts_index(
"text",
language="French",
stem=True,
ascii_folding=True,
replace=True,
)
FTS is especially important for an encyclopedia or a Wiki , where articles are long and packed with names and multi-word terms. Tokenization makes variants like “New York City,” “New-York,” and “NYC” findable, and enables phrase/prefix matches. The result is fast, precise lookup across millions of entries.
FTS and Hybrid Search
FTS is a great way to control search outcomes and makes vector search better and faster . Here’s how:
- During hybrid search, FTS and vector search run in parallel, each finding their own candidate pools.
- FTS finds documents with exact term matches, while vector search finds semantically similar content.
- These results are then combined and reranked using techniques like Reciprocal Rank Fusion (RRF) or weighted scoring, giving you the best of both approaches - precise keyword matching and semantic understanding.
You can often find what embeddings miss, such as rare terms, names, numbers, and words with “must include/exclude” rules. Most of all, you can combine keyword scores with vector scores to rank by both meaning and exact wording , and show highlights to explain why a result matched.
In LanceDB’s Hybrid Search , native FTS blends text and vector signals with weights or via Reciprocal‑Rank Fusion (RRF) for a completely reranked search solution .
The 41M WikiSearch Demo
The demo lets you switch between semantic (vector), full-text (keyword), and hybrid search modes. Semantic or Vector Search finds conceptually related content, even when the exact words differ. Full-text Search excels at finding precise terms and phrases. Hybrid Search combines both approaches - getting the best of semantic understanding while still catching exact matches. Try comparing the different modes to see how they handle various queries.
Behind the Scenes
Step 1: Ingestion
We start with raw articles from Wikipedia and normalize content into pages and sections. Long articles are chunked on headings so each result points to a focused span of text rather than an entire page.
During ingestion we create a schema and columns, such as content
, url
, and title
. Writes are batched (≈200k rows per commit) to maximize throughput.
Figure 1: Data is ingested, embedded, and stored in LanceDB. The user runs queries and retrieves WikiSearch results via our Python SDK.
Step 2: Embedding
A parallel embedding pipeline (configurable model) writes vectors into the vector
column. The demo scripts let you swap the embedding models easily. Here, we are using a basic sentence-transformers
model.
To learn more about vectorization, read our Embedding API docs .
Step 3: Indexing
We build two indexes per table: a vector index (IVF_HNSW_PQ
or IVF_PQ
, depending on your latency/recall/memory goals) over the embedded content, and a native FTS
index over title and body.
This is where you define tokenization and matching options. As you configure the FTS index , you can instruct the Wiki to be broken down in different ways.
Figure 2: Sample LanceDB Cloud table with schema and defined indexes for each column.
Step 4: Service
A thin API fronts LanceDB Cloud. The web UI issues text, vector, or hybrid queries, shows results, and exposes explain_plan
for each request. Deploying the app is a connection string plus credentials…..that’s it!
Check out the entire implementation in GitHub.
How the Search Works
-
A text query first hits the FTS index and returns a pool of candidate document IDs with scores derived from term statistics.
-
A semantic query embeds the input and asks the vector index for nearest neighbors, producing a separate candidate pool with distances.
-
In hybrid mode we normalize these signals and combine them into a reranked search result.
Trying out the search function will reveal a lot about the nature of each search. Semantic Search will count in meaning and context, with less direct precision, while Full-Text Search will look for the precise keyword you’re looking for.
Figure 3: Semantic search is able to detect that a cosmonaut is also an astronaut.
Search Parameters
The search interface gives you full visibility into how your queries are processed. You can see the exact search terms being used, which fields are being searched (title, content, or both), and the scoring weights applied to different components.
This transparency helps you understand why certain results ranked higher and allows you to fine-tune your search strategy.
Figure 3: Behind the scenes, you can see all the Search Parameters for your query.
The Query Plan
Now we’re getting serious. explain_plan
is a very valuable feature that we created to help debug search issues and
optimize performance
. Toggle it to get a structured trace of how LanceDB executed your query.
Figure 5: The Query Plan can be shown for Semantic & Full Text Search. Hybrid Search will be added soon, with detailed outline of the reranker and its effect.
The Query Plan shows:
- Which indexes were used (FTS and/or vector) and with what parameters
- Candidate counts from each stage (text and vector), plus the final returned set
- Filters that applied early vs. at re‑rank
- Timings per stage so you know where to optimize
Performance and Scaling
At ~41 million documents, we needed to add data in batches. We ingested data efficiently using table.add()
, with batches of 200K rows at once:
BATCH_SIZE_LANCEDB = 200_000
for i in range(0, len(all_processed_chunks), BATCH_SIZE):
batch_to_add = all_processed_chunks[i : i + BATCH_SIZE]
try:
table.add(batch_to_add)
except Exception as e:
print(f"Error adding batch to LanceDB: {e}")
table.add(list_of_dicts)
is much faster than adding records individually. Adjust BATCH_SIZE_LANCEDB
based on memory and performance.
Performance at Scale
The core pattern is: parallelize data loading, chunking, and embedding generation, then use table.add(batch)
within each parallel worker to write to LanceDB. LanceDB’s design efficiently handles these concurrent additions. This example uses modal for performing distributed embedding generation and ingestion.
Here are some performance metrics from our side. These numbers represent enterprise-grade performance at massive scale:
Process | Performance |
---|---|
Ingestion: | Using a distributed setup with 50 GPUs (via Modal), we ingested ~41M rows in roughly 11 minutes end‑to‑end. This translates to processing over 60,000 documents per second. |
Indexing: | Vector index build completed in about 30 minutes for the same dataset. Building vector indexes on 41M documents typically takes hours with other solutions. |
Write bandwidth: | LanceDB’s ingestion layer can sustain multi‑GB/s write rates (4 GB/s peak observed in our tests) when batching and parallelism are configured properly. This enables real-time data ingestion for live applications. |
Which Index to Use?
Use IVF_HNSW_PQ
for high recall and predictable latency; use IVF‑PQ
when memory footprint is the constraint and you want excellent throughput at scale. Native FTS
indexes (title, body) handle tokenization and matching; choose options per your corpus.
Trying Things Out
Your numbers will vary based on encoder speed, instance types, and network, but the pattern holds: parallelize embedding, batch writes (e.g., ~200k rows per batch), and build indexes once per checkpointed snapshot. Cloud keeps the table readable while background jobs run, so you can stage changes and cut over via alias swap without downtime.
The Search is Never Complete
Beyond the endless exploration of a large dataset, this demo showcased what’s possible when you combine LanceDB’s native Full-Text Search with vector embeddings.
You get the precision of keyword matching, the semantic understanding of embeddings, and the scalability to handle massive datasets - all in one unified platform.
We built this entire app on LanceDB Cloud , which is free to try and comes with comprehensive tutorials, sample code, and documentation to help you build RAG applications, AI agents, and semantic search engines.