Getting Started with LanceDB: Basic Usage In this section, you'll learn basic operations in Python, TypeScript, and Rust SDKs. For the LanceDB Cloud/Enterprise API Reference, check the HTTP REST API Specification. Installation Options PythonTypeScriptRust pip install lancedb npm install @lancedb/lancedb Bundling @lancedb/lancedb apps with Webpack Since LanceDB contains a prebuilt Node binary, you must configure next.config.js to exclude it from webpack. This is required for both using Next.js and deploying a LanceDB app on Vercel. /** @type {import('next').NextConfig} */ module.exports = ({ webpack(config) { config.externals.push({ '@lancedb/lancedb': '@lancedb/lancedb' }) return config; } }) Yarn users Unlike other package managers, Yarn does not automatically resolve peer dependencies. If you are using Yarn, you will need to manually install 'apache-arrow': yarn add apache-arrow cargo add lancedb To use the lancedb crate, you first need to install protobuf. macOSUbuntu/Debian brew install protobuf sudo apt install -y protobuf-compiler libssl-dev Please also make sure you're using the same version of Arrow as in the lancedb crate Preview Releases Stable releases are created about every 2 weeks. For the latest features and bug fixes, you can install the Preview Release. These releases receive the same level of testing as stable releases but are not guaranteed to be available for more than 6 months after they are released. Once your application is stable, we recommend switching to stable releases. PythonTypeScriptRust pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ lancedb npm install @lancedb/lancedb@preview We don't push Preview Releases to crates.io, but you can reference the tag in GitHub within your Cargo dependencies: [dependencies] lancedb = { git = "https://github.com/lancedb/lancedb.git", tag = "vX.Y.Z-beta.N" } Useful Libraries For this tutorial, we use some common libraries to help us work with data. PythonTypeScript import lancedb import pandas as pd import numpy as np import pyarrow as pa import os import { connect, Index, Table } from '@lancedb/lancedb'; import { FixedSizeList, Field, Float32, Schema, Utf8 } from 'apache-arrow'; Connect to LanceDB LanceDB Cloud / Enterprise Don't forget to get your Cloud API key here! The database cluster is free and serverless. PythonTypeScript CloudEnterprise uri = "db://your-database-uri" api_key = "your-api-key" region = "us-east-1" host_override = os.environ.get("LANCEDB_HOST_OVERRIDE") db = lancedb.connect( uri=uri, api_key=api_key, region=region, host_override=host_override ) CloudEnterprise const dbUri = process.env.LANCEDB_URI || 'db://your-database-uri'; const apiKey = process.env.LANCEDB_API_KEY; const region = process.env.LANCEDB_REGION; const hostOverride = process.env.LANCEDB_HOST_OVERRIDE; const db = await connect(dbUri, { apiKey, region, hostOverride }); LanceDB OSS PythonTypeScriptRust Sync APIAsync API uri = "data/sample-lancedb" db = lancedb.connect(uri) uri = "data/sample-lancedb" db = await lancedb.connect_async(uri) import * as lancedb from "@lancedb/lancedb"; import * as arrow from "apache-arrow"; const db = await lancedb.connect(databaseDir); #[tokio::main] async fn main() -> Result<()> { let uri = "data/sample-lancedb"; let db = connect(uri).execute().await?; } See examples/simple.rs for a full working example. LanceDB will create the directory if it doesn't exist (including parent directories). If you need a reminder of the URI, you can call db.uri(). Tables Create a Table From Data If you have data to insert into the table at creation time, you can simultaneously create a table and insert the data into it. The schema of the data will be used as the schema of the table. PythonTypeScriptRust If the table already exists, LanceDB will raise an error by default. If you want to overwrite the table, you can pass in mode="overwrite" to the create_table method. Sync APIAsync API data = [ {"vector": [3.1, 4.1], "item": "foo", "price": 10.0}, {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}, ] tbl = db.create_table("my_table", data=data) You can also pass in a pandas DataFrame directly: df = pd.DataFrame( [ {"vector": [3.1, 4.1], "item": "foo", "price": 10.0}, {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}, ] ) tbl = db.create_table("table_from_df", data=df) data = [ {"vector": [3.1, 4.1], "item": "foo", "price": 10.0}, {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}, ] tbl = await db.create_table("my_table_async", data=data) You can also pass in a pandas DataFrame directly: df = pd.DataFrame( [ {"vector": [3.1, 4.1], "item": "foo", "price": 10.0}, {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}, ] ) tbl = await db.create_table("table_from_df_async", df) const _tbl = await db.createTable( "myTable", [ { vector: [3.1, 4.1], item: "foo", price: 10.0 }, { vector: [5.9, 26.5], item: "bar", price: 20.0 }, ], { mode: "overwrite" }, ); let initial_data = create_some_records()?; let tbl = db .create_table("my_table", initial_data) .execute() .await .unwrap(); If the table already exists, LanceDB will raise an error by default. See the mode option for details on how to overwrite (or open) existing tables instead. Providing The Rust SDK currently expects data to be provided as an Arrow RecordBatchReader Support for additional formats (such as serde or polars) is on the roadmap. Under the hood, LanceDB reads in the Apache Arrow data and persists it to disk using the Lance format. Automatic embedding generation with Embedding API When working with embedding models, you should use the LanceDB Embedding API to automatically create vector representations of the data and queries in the background. See the Embedding Guide for more detail. Create an Empty Table Sometimes you may not have the data to insert into the table at creation time. In this case, you can create an empty table and specify the schema, so that you can add data to the table at a later time (as long as it conforms to the schema). This is similar to a CREATE TABLE statement in SQL. PythonTypeScriptRust Sync APIAsync API schema = pa.schema([pa.field("vector", pa.list_(pa.float32(), list_size=2))]) tbl = db.create_table("empty_table", schema=schema) schema = pa.schema([pa.field("vector", pa.list_(pa.float32(), list_size=2))]) tbl = await db.create_table("empty_table_async", schema=schema) You can define schema in Pydantic LanceDB comes with Pydantic support, which lets you define the schema of your data using Pydantic models. This makes it easy to work with LanceDB tables and data. Learn more about all supported types in the tables guide. const schema = new arrow.Schema([ new arrow.Field("id", new arrow.Int32()), new arrow.Field("name", new arrow.Utf8()), ]); const emptyTbl = await db.createEmptyTable("empty_table", schema); let schema = Arc::new(Schema::new(vec![ Field::new("id", DataType::Int32, false), Field::new("item", DataType::Utf8, true), ])); db.create_empty_table("empty_table", schema).execute().await Open a Table Once created, you can open a table as follows: PythonTypeScriptRust Sync APIAsync API tbl = db.open_table("my_table") tbl = await db.open_table("my_table_async") const _tbl = await db.openTable("myTable"); let table = db.open_table("my_table").execute().await.unwrap(); List Tables If you forget your table's name, you can always get a listing of all table names: PythonTypeScriptRust Sync APIAsync API print(db.table_names()) print(await db.table_names()) const tableNames = await db.tableNames(); println!("{:?}", db.table_names().execute().await?); Drop Table Use the drop_table() method on the database to remove a table. PythonTypeScriptRust Sync APIAsync API db.drop_table("my_table") await db.drop_table("my_table_async") This permanently removes the table and is not recoverable, unlike deleting rows. By default, if the table does not exist, an exception is raised. To suppress this, you can pass in ignore_missing=True. await db.dropTable("myTable"); db.drop_table("my_table").await.unwrap(); Data LanceDB supports data in several formats: pyarrow, pandas, polars and pydantic. You can also work with regular python lists & dictionaries, as well as json and csv files. Add Data to a Table The data will be appended to the existing table. By default, data is added in append mode, but you can also use mode="overwrite" to replace existing data. Key things to remember Vector columns must have consistent dimensions Schema must match the table's schema Data types must be compatible Null values are supported for optional fields PythonTypeScriptRust Sync APIAsync API # Option 1: Add a list of dicts to a table data = [ {"vector": [1.3, 1.4], "item": "fizz", "price": 100.0}, {"vector": [9.5, 56.2], "item": "buzz", "price": 200.0}, ] tbl.add(data) # Option 2: Add a pandas DataFrame to a table df = pd.DataFrame(data) tbl.add(data) # Option 1: Add a list of dicts to a table data = [ {"vector": [1.3, 1.4], "item": "fizz", "price": 100.0}, {"vector": [9.5, 56.2], "item": "buzz", "price": 200.0}, ] await tbl.add(data) # Option 2: Add a pandas DataFrame to a table df = pd.DataFrame(data) await tbl.add(data) @lancedb/lancedb const data = [ { vector: [1.3, 1.4], item: "fizz", price: 100.0 }, { vector: [9.5, 56.2], item: "buzz", price: 200.0 }, ]; await tbl.add(data); let new_data = create_some_records()?; tbl.add(new_data).execute().await.unwrap(); Delete Rows Use the delete() method on tables to delete rows from a table. To choose which rows to delete, provide a filter that matches on the metadata columns. This can delete any number of rows that match the filter. PythonTypeScriptRust Sync APIAsync API tbl.delete('item = "fizz"') await tbl.delete('item = "fizz"') await tbl.delete('item = "fizz"'); tbl.delete("id > 24").await.unwrap(); The deletion predicate is a SQL expression that supports the same expressions as the where() clause (only_if() in Rust) on a search. They can be as simple or complex as needed. To see what expressions are supported, see the SQL filters section. PythonTypeScriptRust Sync APIAsync API Read more: lancedb.table.Table.delete Read more: lancedb.table.AsyncTable.delete Read more: lancedb.Table.delete Read more: lancedb::Table::delete Vector Search Once you've embedded the query, you can find its nearest neighbors as follows. LanceDB uses L2 (Euclidean) distance by default, but supports other distance metrics like cosine similarity and dot product. PythonTypeScriptRust Sync APIAsync API tbl.search([100, 100]).limit(2).to_pandas() await tbl.vector_search([100, 100]).limit(2).to_pandas() This returns a pandas DataFrame with the results. const res = await tbl.search([100, 100]).limit(2).toArray(); use futures::TryStreamExt; table .query() .limit(2) .nearest_to(&[1.0; 128])? .execute() .await? .try_collect::<Vec<_>>() .await Query Rust does not yet support automatic execution of embedding functions. You will need to calculate embeddings yourself. Support for this is on the roadmap and can be tracked at https://github.com/lancedb/lancedb/issues/994 Query vectors can be provided as Arrow arrays or a Vec/slice of Rust floats. Support for additional formats (e.g. polars::series::Series) is on the roadmap. Build an Index By default, LanceDB runs a brute-force scan over the dataset to find the K nearest neighbors (KNN). For larger datasets, this can be computationally expensive. Indexing Threshold: If your table has more than 50,000 vectors, you should create an ANN index to speed up search performance. The index uses IVF (Inverted File) partitioning to reduce the search space. PythonTypeScriptRust Sync APIAsync API tbl.create_index(num_sub_vectors=1) await tbl.create_index("vector") await tbl.createIndex("vector"); table.create_index(&["vector"], Index::Auto).execute().await Why is index creation manual? LanceDB does not automatically create the ANN index for two reasons. First, it's optimized for really fast retrievals via a disk-based index, and second, data and query workloads can be very diverse, so there's no one-size-fits-all index configuration. LanceDB provides many parameters to fine-tune index size, query latency, and accuracy. Embedding Data You can use the Embedding API when working with embedding models. It automatically vectorizes the data at ingestion and query time and comes with built-in integrations with popular embedding models like OpenAI, Hugging Face, Sentence Transformers, CLIP, and more. PythonTypeScriptRust Sync APIAsync API from lancedb.pydantic import LanceModel, Vector from lancedb.embeddings import get_registry db = lancedb.connect("/tmp/db") func = get_registry().get("openai").create(name="text-embedding-ada-002") class Words(LanceModel): text: str = func.SourceField() vector: Vector(func.ndims()) = func.VectorField() table = db.create_table("words", schema=Words, mode="overwrite") table.add([{"text": "hello world"}, {"text": "goodbye world"}]) query = "greetings" actual = table.search(query).limit(1).to_pydantic(Words)[0] print(actual.text) Coming soon to the async API. https://github.com/lancedb/lancedb/issues/1938 @lancedb/lancedb import * as lancedb from "@lancedb/lancedb"; import "@lancedb/lancedb/embedding/openai"; import { LanceSchema, getRegistry, register } from "@lancedb/lancedb/embedding"; import { EmbeddingFunction } from "@lancedb/lancedb/embedding"; import { type Float, Float32, Utf8 } from "apache-arrow"; const db = await lancedb.connect(databaseDir); const func = getRegistry() .get("openai") ?.create({ model: "text-embedding-ada-002" }) as EmbeddingFunction; const wordsSchema = LanceSchema({ text: func.sourceField(new Utf8()), vector: func.vectorField(), }); const tbl = await db.createEmptyTable("words", wordsSchema, { mode: "overwrite", }); await tbl.add([{ text: "hello world" }, { text: "goodbye world" }]); const query = "greetings"; const actual = (await tbl.search(query).limit(1).toArray())[0]; use std::{iter::once, sync::Arc}; use arrow_array::{Float64Array, Int32Array, RecordBatch, RecordBatchIterator, StringArray}; use arrow_schema::{DataType, Field, Schema}; use futures::StreamExt; use lancedb::{ arrow::IntoArrow, connect, embeddings::{openai::OpenAIEmbeddingFunction, EmbeddingDefinition, EmbeddingFunction}, query::{ExecutableQuery, QueryBase}, Result, }; #[tokio::main] async fn main() -> Result<()> { let tempdir = tempfile::tempdir().unwrap(); let tempdir = tempdir.path().to_str().unwrap(); let api_key = std::env::var("OPENAI_API_KEY").expect("OPENAI_API_KEY is not set"); let embedding = Arc::new(OpenAIEmbeddingFunction::new_with_model( api_key, "text-embedding-3-large", )?); let db = connect(tempdir).execute().await?; db.embedding_registry() .register("openai", embedding.clone())?; let table = db .create_table("vectors", make_data()) .add_embedding(EmbeddingDefinition::new( "text", "openai", Some("embeddings"), ))? .execute() .await?; let query = Arc::new(StringArray::from_iter_values(once("something warm"))); let query_vector = embedding.compute_query_embeddings(query)?; let mut results = table .vector_search(query_vector)? .limit(1) .execute() .await?; let rb = results.next().await.unwrap()?; let out = rb .column_by_name("text") .unwrap() .as_any() .downcast_ref::<StringArray>() .unwrap(); let text = out.iter().next().unwrap().unwrap(); println!("Closest match: {}", text); Ok(()) } Learn about using the existing integrations and creating custom embedding functions in the Embedding Guide. What's Next? This section covered the very basics of using LanceDB. We've prepared another example to teach you about working with whole datasets. To learn more about vector databases, you may want to read about Indexing to get familiar with the concepts. If you've already worked with other vector databases, dive into the Table Guide to learn how to work with LanceDB in more detail.