LanceDB Cloud and Enterprise provide performant full-text search based on BM25, allowing you to incorporate keyword-based search in your retrieval solutions.
create_fts_index
API returns immediately, but the building of the FTS index is asynchronous.
Creating FTS Indexes
import lancedb
# Connect to LanceDB
db = lancedb.connect(
uri="db://your-project-slug",
api_key="your-api-key",
region="us-east-1"
)
table_name = "lancedb-cloud-quickstart"
table = db.open_table(table_name)
table.create_fts_index("text")
import * as lancedb from "@lancedb/lancedb"
const db = await lancedb.connect({
uri: "db://your-project-slug",
apiKey: "your-api-key",
region: "us-east-1"
});
const tableName = "lancedb-cloud-quickstart"
const table = openTable(tableName);
await table.createIndex("text", {
config: lancedb.Index.fts()
});
Check FTS index status using the methods above .
index_name = "text_idx"
table.wait_for_index([index_name])
const indexName = "text_idx"
await table.waitForIndex([indexName], 60)
Configuration Options
FTS Configuration Parameters
LanceDB supports the following configurable parameters for full-text search:
Parameter | Type | Default | Description |
---|---|---|---|
with_position | bool | False | Store token positions (required for phrase queries) |
base_tokenizer | str | “simple” | Text splitting method: - “simple”: Split by whitespace/punctuation - “whitespace”: Split by whitespace only - “raw”: Treat as single token |
language | str | “English” | Language for tokenization (stemming/stop words) |
max_token_length | int | 40 | Maximum token size in bytes; tokens exceeding this length are omitted from the index |
lower_case | bool | True | Convert tokens to lowercase |
stem | bool | True | Apply stemming (e.g., “running” → “run”) |
remove_stop_words | bool | True | Remove common stop words |
ascii_folding | bool | True | Normalize accented characters |
💡 Key Parameters
- The
max_token_length
parameter helps optimize indexing performance by filtering out non-linguistic content like base64 data and long URLs - When
with_position
is disabled, phrase queries will not work, but index size is reduced and indexing is faster ascii_folding
is useful for handling international text (e.g., “café” → “cafe”)
Phrase Query Configuration
To enable phrase queries, you must modify these parameters from their default values:
Parameter | Required Value | Purpose |
---|---|---|
with_position | True | Enables tracking of token positions for phrase matching |
remove_stop_words | False | Preserves all words, including stop words, for exact phrase matching |