LanceDB WikiSearch: Native Full-Text Search on 41M Wikipedia Docs

LanceDB WikiSearch: Native Full-Text Search on 41M Wikipedia Docs

It’s not about the vectors. It’s about getting the right result.

Many of our users are building RAG and search apps, and they want three things above all: precision, scale, and simplicity. In this article, we introduce WikiSearch , our flagship demo that delivers all with minimal code .

WikiSearch is a very simple search engine that stores and searches through real Wikipedia entries. You don’t see it, but there is a lot of content sitting in LanceDB Cloud - and we use Full Text Search to go through it. Vector search is still there for semantic relevance, and we merge both into a powerful Hybrid Search solution .

Scaling to 41 million documents in production presented significant engineering challenges. Here are the key performance breakthroughs we achieved:

Metric Performance
Ingestion We processed 60,000+ documents per second with distributed GPU processing
Indexing We built vector indexes on 41M documents in just 30 minutes
Write Bandwidth We sustained 4 GB/s peak write rates for real-time applications
💡 We previously used Tantivy for our search implementation. Now, we have a native FTS solution built directly into LanceDB that provides better integration and offers superior performance for our specific use cases. This native approach eliminates the need for external dependencies and ensures optimal performance.

Why Full-Text Search Helped

Full-Text Search (FTS) lets you find the exact words, phrases, and spellings people care about. It complements vector search by catching precise constraints, rare terms, and operators (phrases, boolean logic, field boosts) that embeddings alone often miss.

It works by tokenization: splitting text into small, searchable pieces called tokens. It lowercases words, removes punctuation, and can reduce words to a base form (e.g., “running” → “run”).

Here is how basic stemming is enabled for an English-language text:

python
table.create_fts_index("text", language="English", replace=True)

This request creates and stores tokens into an inverted (FTS) index. The tokenizer you choose can be standard, language-aware, or n-gram & more . Configuration directly shapes recall and precision, so you have a lot of freedom to play around with the parameters and match them to your use case.

FTS handles multilingual text, too. For French, enable ascii_folding during index creation to strip accents (e.g., “é” → “e”), so queries match words regardless of diacritics.

python
table.create_fts_index(
        "text",
        language="French",
        stem=True,
        ascii_folding=True,
        replace=True,
    )
💡 Now, when you search for the name René, you can afford to make a mistake!

FTS is especially important for an encyclopedia or a Wiki , where articles are long and packed with names and multi-word terms. Tokenization makes variants like “New York City,” “New-York,” and “NYC” findable, and enables phrase/prefix matches. The result is fast, precise lookup across millions of entries.

FTS is a great way to control search outcomes and makes vector search better and faster . Here’s how:

  1. During hybrid search, FTS and vector search run in parallel, each finding their own candidate pools.
  2. FTS finds documents with exact term matches, while vector search finds semantically similar content.
  3. These results are then combined and reranked using techniques like Reciprocal Rank Fusion (RRF) or weighted scoring, giving you the best of both approaches - precise keyword matching and semantic understanding.

You can often find what embeddings miss, such as rare terms, names, numbers, and words with “must include/exclude” rules. Most of all, you can combine keyword scores with vector scores to rank by both meaning and exact wording , and show highlights to explain why a result matched.

In LanceDB’s Hybrid Search , native FTS blends text and vector signals with weights or via Reciprocal‑Rank Fusion (RRF) for a completely reranked search solution .

💡 To learn more about Hybrid Search, give this example a try .

The 41M WikiSearch Demo

Wikipedia Search Demo

💡 Want to know who wrote Romeo and Juliet? Give it a spin!

The demo lets you switch between semantic (vector), full-text (keyword), and hybrid search modes. Semantic or Vector Search finds conceptually related content, even when the exact words differ. Full-text Search excels at finding precise terms and phrases. Hybrid Search combines both approaches - getting the best of semantic understanding while still catching exact matches. Try comparing the different modes to see how they handle various queries.

Behind the Scenes

Step 1: Ingestion

We start with raw articles from Wikipedia and normalize content into pages and sections. Long articles are chunked on headings so each result points to a focused span of text rather than an entire page.

During ingestion we create a schema and columns, such as content, url, and title. Writes are batched (≈200k rows per commit) to maximize throughput.

Figure 1: Data is ingested, embedded, and stored in LanceDB. The user runs queries and retrieves WikiSearch results via our Python SDK. Wikipedia Search Demo

Step 2: Embedding

A parallel embedding pipeline (configurable model) writes vectors into the vector column. The demo scripts let you swap the embedding models easily. Here, we are using a basic sentence-transformers model.

To learn more about vectorization, read our Embedding API docs .

Step 3: Indexing

We build two indexes per table: a vector index (IVF_HNSW_PQ or IVF_PQ, depending on your latency/recall/memory goals) over the embedded content, and a native FTS index over title and body.

This is where you define tokenization and matching options. As you configure the FTS index , you can instruct the Wiki to be broken down in different ways.

Figure 2: Sample LanceDB Cloud table with schema and defined indexes for each column. Indexed Table

Step 4: Service

A thin API fronts LanceDB Cloud. The web UI issues text, vector, or hybrid queries, shows results, and exposes explain_plan for each request. Deploying the app is a connection string plus credentials…..that’s it! Check out the entire implementation in GitHub.

How the Search Works

  1. A text query first hits the FTS index and returns a pool of candidate document IDs with scores derived from term statistics.

  2. A semantic query embeds the input and asks the vector index for nearest neighbors, producing a separate candidate pool with distances.

  3. In hybrid mode we normalize these signals and combine them into a reranked search result.

Trying out the search function will reveal a lot about the nature of each search. Semantic Search will count in meaning and context, with less direct precision, while Full-Text Search will look for the precise keyword you’re looking for.

Figure 3: Semantic search is able to detect that a cosmonaut is also an astronaut. Wikipedia Search Demo

Search Parameters

The search interface gives you full visibility into how your queries are processed. You can see the exact search terms being used, which fields are being searched (title, content, or both), and the scoring weights applied to different components.

This transparency helps you understand why certain results ranked higher and allows you to fine-tune your search strategy.

Figure 3: Behind the scenes, you can see all the Search Parameters for your query. Wikipedia Search Demo

The Query Plan

Now we’re getting serious. explain_plan is a very valuable feature that we created to help debug search issues and optimize performance . Toggle it to get a structured trace of how LanceDB executed your query.

Figure 5: The Query Plan can be shown for Semantic & Full Text Search. Hybrid Search will be added soon, with detailed outline of the reranker and its effect. Wikipedia Search Demo

The Query Plan shows:

  • Which indexes were used (FTS and/or vector) and with what parameters
  • Candidate counts from each stage (text and vector), plus the final returned set
  • Filters that applied early vs. at re‑rank
  • Timings per stage so you know where to optimize

Performance and Scaling

At ~41 million documents, we needed to add data in batches. We ingested data efficiently using table.add(), with batches of 200K rows at once:

python
BATCH_SIZE_LANCEDB = 200_000 

for i in range(0, len(all_processed_chunks), BATCH_SIZE):
    batch_to_add = all_processed_chunks[i : i + BATCH_SIZE]
    try:
        table.add(batch_to_add)
    except Exception as e:
        print(f"Error adding batch to LanceDB: {e}")
💡 Batching table.add(list_of_dicts) is much faster than adding records individually. Adjust BATCH_SIZE_LANCEDB based on memory and performance.

Performance at Scale

The core pattern is: parallelize data loading, chunking, and embedding generation, then use table.add(batch) within each parallel worker to write to LanceDB. LanceDB’s design efficiently handles these concurrent additions. This example uses modal for performing distributed embedding generation and ingestion.

Here are some performance metrics from our side. These numbers represent enterprise-grade performance at massive scale:

Process Performance
Ingestion: Using a distributed setup with 50 GPUs (via Modal), we ingested ~41M rows in roughly 11 minutes end‑to‑end. This translates to processing over 60,000 documents per second.
Indexing: Vector index build completed in about 30 minutes for the same dataset. Building vector indexes on 41M documents typically takes hours with other solutions.
Write bandwidth: LanceDB’s ingestion layer can sustain multi‑GB/s write rates (4 GB/s peak observed in our tests) when batching and parallelism are configured properly. This enables real-time data ingestion for live applications.

Which Index to Use?

Use IVF_HNSW_PQ for high recall and predictable latency; use IVF‑PQ when memory footprint is the constraint and you want excellent throughput at scale. Native FTS indexes (title, body) handle tokenization and matching; choose options per your corpus.

Trying Things Out

Your numbers will vary based on encoder speed, instance types, and network, but the pattern holds: parallelize embedding, batch writes (e.g., ~200k rows per batch), and build indexes once per checkpointed snapshot. Cloud keeps the table readable while background jobs run, so you can stage changes and cut over via alias swap without downtime.

💡 Full Code and Tutorial
Check out the GitHub Repository.

The Search is Never Complete

Beyond the endless exploration of a large dataset, this demo showcased what’s possible when you combine LanceDB’s native Full-Text Search with vector embeddings.

You get the precision of keyword matching, the semantic understanding of embeddings, and the scalability to handle massive datasets - all in one unified platform.

We built this entire app on LanceDB Cloud , which is free to try and comes with comprehensive tutorials, sample code, and documentation to help you build RAG applications, AI agents, and semantic search engines.

Wikipedia Search Demo

David Myriel

David Myriel

Writer.

Ayush Chaurasia

Ayush Chaurasia

ML Engineer and researcher focused on multi-modal AI systems and efficient retrieval methods.