Working With Tables in LanceDB In LanceDB, tables store records with a defined schema that specifies column names and types. You can create LanceDB tables from these data formats: Pandas DataFrames Polars DataFrames Apache Arrow Tables The Python SDK additionally supports: PyArrow schemas for explicit schema control LanceModel for Pydantic-based validation Create a LanceDB Table Initialize a LanceDB connection and create a table PythonTypescript Sync APIAsync API import lancedb uri = "data/sample-lancedb" db = lancedb.connect(uri) import lancedb uri = "data/sample-lancedb" async_db = await lancedb.connect_async(uri) @lancedb/lancedbvectordb (deprecated) import * as lancedb from "@lancedb/lancedb"; import * as arrow from "apache-arrow"; const uri = "data/sample-lancedb"; const db = await lancedb.connect(uri); const lancedb = require("vectordb"); const arrow = require("apache-arrow"); const uri = "data/sample-lancedb"; const db = await lancedb.connect(uri); LanceDB allows ingesting data from various sources - dict, list[dict], pd.DataFrame, pa.Table or a Iterator[pa.RecordBatch]. Let's take a look at some of the these. From list of tuples or dictionaries PythonTypescript Sync APIAsync API data = [ {"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7}, {"vector": [0.2, 1.8], "lat": 40.1, "long": -74.1}, ] db.create_table("test_table", data) db["test_table"].head() data = [ {"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7}, {"vector": [0.2, 1.8], "lat": 40.1, "long": -74.1}, ] async_tbl = await async_db.create_table("test_table_async", data) await async_tbl.head() Note If the table already exists, LanceDB will raise an error by default. create_table supports an optional exist_ok parameter. When set to True and the table exists, then it simply opens the existing table. The data you passed in will NOT be appended to the table in that case. Sync APIAsync API db.create_table("test_table", data, exist_ok=True) await async_db.create_table("test_table_async", data, exist_ok=True) Sometimes you want to make sure that you start fresh. If you want to overwrite the table, you can pass in mode="overwrite" to the createTable function. Sync APIAsync API db.create_table("test_table", data, mode="overwrite") await async_db.create_table("test_table_async", data, mode="overwrite") You can create a LanceDB table in JavaScript using an array of records as follows. @lancedb/lancedbvectordb (deprecated) const _tbl = await db.createTable( "myTable", [ { vector: [3.1, 4.1], item: "foo", price: 10.0 }, { vector: [5.9, 26.5], item: "bar", price: 20.0 }, ], { mode: "overwrite" }, ); This will infer the schema from the provided data. If you want to explicitly provide a schema, you can use apache-arrow to declare a schema const schema = new arrow.Schema([ new arrow.Field( "vector", new arrow.FixedSizeList( 2, new arrow.Field("item", new arrow.Float32(), true), ), ), new arrow.Field("item", new arrow.Utf8(), true), new arrow.Field("price", new arrow.Float32(), true), ]); const data = [ { vector: [3.1, 4.1], item: "foo", price: 10.0 }, { vector: [5.9, 26.5], item: "bar", price: 20.0 }, ]; const tbl = await db.createTable("myTable", data, { schema, }); Note createTable supports an optional existsOk parameter. When set to true and the table exists, then it simply opens the existing table. The data you passed in will NOT be appended to the table in that case. const tbl = await db.createTable("myTable", data, { existOk: true, }); Sometimes you want to make sure that you start fresh. If you want to overwrite the table, you can pass in mode: "overwrite" to the createTable function. const tbl = await db.createTable("myTable", data, { mode: "overwrite", }); const tbl = await db.createTable( "myTable", [ { vector: [3.1, 4.1], item: "foo", price: 10.0 }, { vector: [5.9, 26.5], item: "bar", price: 20.0 }, ], { writeMode: lancedb.WriteMode.Overwrite }, ); This will infer the schema from the provided data. If you want to explicitly provide a schema, you can use apache-arrow to declare a schema const schema = new arrow.Schema([ new arrow.Field( "vector", new arrow.FixedSizeList( 2, new arrow.Field("item", new arrow.Float32(), true), ), ), new arrow.Field("item", new arrow.Utf8(), true), new arrow.Field("price", new arrow.Float32(), true), ]); const data = [ { vector: [3.1, 4.1], item: "foo", price: 10.0 }, { vector: [5.9, 26.5], item: "bar", price: 20.0 }, ]; const tbl = await db.createTable({ name: "myTableWithSchema", data, schema, }); Warning existsOk is not available in vectordb If the table already exists, vectordb will raise an error by default. You can use writeMode: WriteMode.Overwrite to overwrite the table. But this will delete the existing table and create a new one with the same name. Sometimes you want to make sure that you start fresh. If you want to overwrite the table, you can pass in writeMode: lancedb.WriteMode.Overwrite to the createTable function. const table = await con.createTable(tableName, data, { writeMode: WriteMode.Overwrite }) From a Pandas DataFrame Sync APIAsync API import pandas as pd data = pd.DataFrame( { "vector": [[1.1, 1.2, 1.3, 1.4], [0.2, 1.8, 0.4, 3.6]], "lat": [45.5, 40.1], "long": [-122.7, -74.1], } ) db.create_table("my_table_pandas", data) db["my_table_pandas"].head() import pandas as pd data = pd.DataFrame( { "vector": [[1.1, 1.2, 1.3, 1.4], [0.2, 1.8, 0.4, 3.6]], "lat": [45.5, 40.1], "long": [-122.7, -74.1], } ) async_tbl = await async_db.create_table("my_table_async_pd", data) await async_tbl.head() Note Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly. The vector column needs to be a Vector (defined as pyarrow.FixedSizeList) type. Sync APIAsync API import pyarrow as pa custom_schema = pa.schema( [ pa.field("vector", pa.list_(pa.float32(), 4)), pa.field("lat", pa.float32()), pa.field("long", pa.float32()), ] ) tbl = db.create_table("my_table_custom_schema", data, schema=custom_schema) import pyarrow as pa custom_schema = pa.schema( [ pa.field("vector", pa.list_(pa.float32(), 4)), pa.field("lat", pa.float32()), pa.field("long", pa.float32()), ] ) async_tbl = await async_db.create_table( "my_table_async_custom_schema", data, schema=custom_schema ) From a Polars DataFrame LanceDB supports Polars, a modern, fast DataFrame library written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow under the hood. A deeper integration between LanceDB Tables and Polars DataFrames is on the way. Sync APIAsync API import polars as pl data = pl.DataFrame( { "vector": [[3.1, 4.1], [5.9, 26.5]], "item": ["foo", "bar"], "price": [10.0, 20.0], } ) tbl = db.create_table("my_table_pl", data) import polars as pl data = pl.DataFrame( { "vector": [[3.1, 4.1], [5.9, 26.5]], "item": ["foo", "bar"], "price": [10.0, 20.0], } ) async_tbl = await async_db.create_table("my_table_async_pl", data) From an Arrow Table You can also create LanceDB tables directly from Arrow tables. LanceDB supports float16 data type! PythonTypescript Sync APIAsync API import pyarrow as pa import numpy as np dim = 16 total = 2 schema = pa.schema( [pa.field("vector", pa.list_(pa.float16(), dim)), pa.field("text", pa.string())] ) data = pa.Table.from_arrays( [ pa.array( [np.random.randn(dim).astype(np.float16) for _ in range(total)], pa.list_(pa.float16(), dim), ), pa.array(["foo", "bar"]), ], ["vector", "text"], ) tbl = db.create_table("f16_tbl", data, schema=schema) import polars as pl import numpy as np dim = 16 total = 2 schema = pa.schema( [pa.field("vector", pa.list_(pa.float16(), dim)), pa.field("text", pa.string())] ) data = pa.Table.from_arrays( [ pa.array( [np.random.randn(dim).astype(np.float16) for _ in range(total)], pa.list_(pa.float16(), dim), ), pa.array(["foo", "bar"]), ], ["vector", "text"], ) async_tbl = await async_db.create_table("f16_tbl_async", data, schema=schema) @lancedb/lancedbvectordb (deprecated) const db = await lancedb.connect(databaseDir); const dim = 16; const total = 10; const f16Schema = new Schema([ new Field("id", new Int32()), new Field( "vector", new FixedSizeList(dim, new Field("item", new Float16(), true)), false, ), ]); const data = lancedb.makeArrowTable( Array.from(Array(total), (_, i) => ({ id: i, vector: Array.from(Array(dim), Math.random), })), { schema: f16Schema }, ); const _table = await db.createTable("f16_tbl", data); const dim = 16; const total = 10; const schema = new Schema([ new Field("id", new Int32()), new Field( "vector", new FixedSizeList(dim, new Field("item", new Float16(), true)), false, ), ]); const data = lancedb.makeArrowTable( Array.from(Array(total), (_, i) => ({ id: i, vector: Array.from(Array(dim), Math.random), })), { schema }, ); const table = await db.createTable("f16_tbl", data); From Pydantic Models When you create an empty table without data, you must specify the table schema. LanceDB supports creating tables by specifying a PyArrow schema or a specialized Pydantic model called LanceModel. For example, the following Content model specifies a table with 5 columns: movie_id, vector, genres, title, and imdb_id. When you create a table, you can pass the class as the value of the schema parameter to create_table. The vector column is a Vector type, which is a specialized Pydantic type that can be configured with the vector dimensions. It is also important to note that LanceDB only understands subclasses of lancedb.pydantic.LanceModel (which itself derives from pydantic.BaseModel). Sync APIAsync API from lancedb.pydantic import Vector, LanceModel import pyarrow as pa class Content(LanceModel): movie_id: int vector: Vector(128) genres: str title: str imdb_id: int @property def imdb_url(self) -> str: return f"https://www.imdb.com/title/tt{self.imdb_id}" tbl = db.create_table("movielens_small", schema=Content) from lancedb.pydantic import Vector, LanceModel import pyarrow as pa class Content(LanceModel): movie_id: int vector: Vector(128) genres: str title: str imdb_id: int @property def imdb_url(self) -> str: return f"https://www.imdb.com/title/tt{self.imdb_id}" async_tbl = await async_db.create_table("movielens_small_async", schema=Content) Nested schemas Sometimes your data model may contain nested objects. For example, you may want to store the document string and the document source name as a nested Document object: from pydantic import BaseModel class Document(BaseModel): content: str source: str This can be used as the type of a LanceDB table column: Sync APIAsync API class NestedSchema(LanceModel): id: str vector: Vector(1536) document: Document tbl = db.create_table("nested_table", schema=NestedSchema) class NestedSchema(LanceModel): id: str vector: Vector(1536) document: Document async_tbl = await async_db.create_table("nested_table_async", schema=NestedSchema) This creates a struct column called "document" that has two subfields called "content" and "source": In [28]: tbl.schema Out[28]: id: string not null vector: fixed_size_list<item: float>[1536] not null child 0, item: float document: struct<content: string not null, source: string not null> not null child 0, content: string not null child 1, source: string not null Validators Note that neither Pydantic nor PyArrow automatically validates that input data is of the correct timezone, but this is easy to add as a custom field validator: from datetime import datetime from zoneinfo import ZoneInfo from lancedb.pydantic import LanceModel from pydantic import Field, field_validator, ValidationError, ValidationInfo tzname = "America/New_York" tz = ZoneInfo(tzname) class TestModel(LanceModel): dt_with_tz: datetime = Field(json_schema_extra={"tz": tzname}) @field_validator('dt_with_tz') @classmethod def tz_must_match(cls, dt: datetime) -> datetime: assert dt.tzinfo == tz return dt ok = TestModel(dt_with_tz=datetime.now(tz)) try: TestModel(dt_with_tz=datetime.now(ZoneInfo("Asia/Shanghai"))) assert 0 == 1, "this should raise ValidationError" except ValidationError: print("A ValidationError was raised.") pass When you run this code it should print "A ValidationError was raised." Pydantic custom types LanceDB does NOT yet support converting pydantic custom types. If this is something you need, please file a feature request on the LanceDB Github repo. Using Iterators / Writing Large Datasets It is recommended to use iterators to add large datasets in batches when creating your table in one go. This does not create multiple versions of your dataset unlike manually adding batches using table.add() LanceDB additionally supports PyArrow's RecordBatch Iterators or other generators producing supported data types. Here's an example using using RecordBatch iterator for creating tables. Sync APIAsync API import pyarrow as pa def make_batches(): for i in range(5): yield pa.RecordBatch.from_arrays( [ pa.array( [[3.1, 4.1, 5.1, 6.1], [5.9, 26.5, 4.7, 32.8]], pa.list_(pa.float32(), 4), ), pa.array(["foo", "bar"]), pa.array([10.0, 20.0]), ], ["vector", "item", "price"], ) schema = pa.schema( [ pa.field("vector", pa.list_(pa.float32(), 4)), pa.field("item", pa.utf8()), pa.field("price", pa.float32()), ] ) db.create_table("batched_tale", make_batches(), schema=schema) import pyarrow as pa def make_batches(): for i in range(5): yield pa.RecordBatch.from_arrays( [ pa.array( [[3.1, 4.1, 5.1, 6.1], [5.9, 26.5, 4.7, 32.8]], pa.list_(pa.float32(), 4), ), pa.array(["foo", "bar"]), pa.array([10.0, 20.0]), ], ["vector", "item", "price"], ) schema = pa.schema( [ pa.field("vector", pa.list_(pa.float32(), 4)), pa.field("item", pa.utf8()), pa.field("price", pa.float32()), ] ) await async_db.create_table("batched_table", make_batches(), schema=schema) You can also use iterators of other types like Pandas DataFrame or Pylists directly in the above example. Open existing tables PythonTypescript If you forget the name of your table, you can always get a listing of all table names. Sync APIAsync API print(db.table_names()) print(await async_db.table_names()) Then, you can open any existing tables. Sync APIAsync API tbl = db.open_table("test_table") async_tbl = await async_db.open_table("test_table_async") If you forget the name of your table, you can always get a listing of all table names. console.log(await db.tableNames()); Then, you can open any existing tables. const tbl = await db.openTable("my_table"); Creating empty table You can create an empty table for scenarios where you want to add data to the table later. An example would be when you want to collect data from a stream/external file and then add it to a table in batches. PythonTypescript An empty table can be initialized via a PyArrow schema. Sync APIAsync API import lancedb import pyarrow as pa schema = pa.schema( [ pa.field("vector", pa.list_(pa.float32(), 2)), pa.field("item", pa.string()), pa.field("price", pa.float32()), ] ) tbl = db.create_table("test_empty_table", schema=schema) import lancedb import pyarrow as pa schema = pa.schema( [ pa.field("vector", pa.list_(pa.float32(), 2)), pa.field("item", pa.string()), pa.field("price", pa.float32()), ] ) async_tbl = await async_db.create_table("test_empty_table_async", schema=schema) Alternatively, you can also use Pydantic to specify the schema for the empty table. Note that we do not directly import pydantic but instead use lancedb.pydantic which is a subclass of pydantic.BaseModel that has been extended to support LanceDB specific types like Vector. Sync APIAsync API import lancedb from lancedb.pydantic import Vector, LanceModel class Item(LanceModel): vector: Vector(2) item: str price: float tbl = db.create_table("test_empty_table_new", schema=Item.to_arrow_schema()) import lancedb from lancedb.pydantic import Vector, LanceModel class Item(LanceModel): vector: Vector(2) item: str price: float async_tbl = await async_db.create_table( "test_empty_table_async_new", schema=Item.to_arrow_schema() ) Once the empty table has been created, you can add data to it via the various methods listed in the Adding to a table section. @lancedb/lancedbvectordb (deprecated) const schema = new arrow.Schema([ new arrow.Field("id", new arrow.Int32()), new arrow.Field("name", new arrow.Utf8()), ]); const emptyTbl = await db.createEmptyTable("empty_table", schema); const schema = new arrow.Schema([ new arrow.Field("id", new arrow.Int32()), new arrow.Field("name", new arrow.Utf8()), ]); const empty_tbl = await db.createTable({ name: "empty_table", schema }); Drop a table Use the drop_table() method on the database to remove a table. PythonTypeScript Sync APIAsync API db.drop_table("my_table") await db.drop_table("my_table_async") This permanently removes the table and is not recoverable, unlike deleting rows. By default, if the table does not exist an exception is raised. To suppress this, you can pass in ignore_missing=True. await db.dropTable("myTable"); This permanently removes the table and is not recoverable, unlike deleting rows. If the table does not exist an exception is raised.