LangChain 🔗 LangChain is a framework designed for building applications with large language models (LLMs) by chaining together various components. It supports a range of functionalities including memory, agents, and chat models, enabling developers to create context-aware applications. LangChain streamlines these stages (in figure above) by providing pre-built components and tools for integration, memory management, and deployment, allowing developers to focus on application logic rather than underlying complexities. Integration of Langchain with LanceDB enables applications to retrieve the most relevant data by comparing query vectors against stored vectors, facilitating effective information retrieval. It results in better and context aware replies and actions by the LLMs. Quick Start You can load your document data using langchain's loaders, for this example we are using TextLoader and OpenAIEmbeddings as the embedding model. Checkout Complete example here - LangChain demo import os from langchain.document_loaders import TextLoader from langchain.vectorstores import LanceDB from langchain_openai import OpenAIEmbeddings from langchain_text_splitters import CharacterTextSplitter os.environ["OPENAI_API_KEY"] = "sk-..." loader = TextLoader("../../modules/state_of_the_union.txt") # Replace with your data path documents = loader.load() documents = CharacterTextSplitter().split_documents(documents) embeddings = OpenAIEmbeddings() docsearch = LanceDB.from_documents(documents, embeddings) query = "What did the president say about Ketanji Brown Jackson" docs = docsearch.similarity_search(query) print(docs[0].page_content) Documentation In the above example LanceDB vector store class object is created using from_documents() method which is a classmethod and returns the initialized class object. You can also use LanceDB.from_texts(texts: List[str],embedding: Embeddings) class method. The exhaustive list of parameters for LanceDB vector store are : Name type Purpose default connection (Optional) Any lancedb.db.LanceDBConnection connection object to use. If not provided, a new connection will be created. None embedding (Optional) Embeddings Langchain embedding model. Provided by user. uri (Optional) str It specifies the directory location of LanceDB database and establishes a connection that can be used to interact with the database. /tmp/lancedb vector_key (Optional) str Column name to use for vector's in the table. 'vector' id_key (Optional) str Column name to use for id's in the table. 'id' text_key (Optional) str Column name to use for text in the table. 'text' table_name (Optional) str Name of your table in the database. 'vectorstore' api_key (Optional str) API key to use for LanceDB cloud database. None region (Optional) str Region to use for LanceDB cloud database. Only for LanceDB Cloud : None. mode (Optional) str Mode to use for adding data to the table. Valid values are "append" and "overwrite". 'overwrite' table (Optional) Any You can connect to an existing table of LanceDB, created outside of langchain, and utilize it. None distance (Optional) str The choice of distance metric used to calculate the similarity between vectors. 'l2' reranker (Optional) Any The reranker to use for LanceDB. None relevance_score_fn (Optional) Callable[[float], float] Langchain relevance score function to be used. None limit int Set the maximum number of results to return. DEFAULT_K (it is 4) db_url = "db://lang_test" # url of db you created api_key = "xxxxx" # your API key region="us-east-1-dev" # your selected region vector_store = LanceDB( uri=db_url, api_key=api_key, #(dont include for local API) region=region, #(dont include for local API) embedding=embeddings, table_name='langchain_test' # Optional ) Methods add_texts() This method turn texts into embedding and add it to the database. Name Purpose defaults texts Iterable of strings to add to the vectorstore. Provided by user metadatas Optional list[dict()] of metadatas associated with the texts. None ids Optional list of ids to associate with the texts. None kwargs Other keyworded arguments provided by the user. - It returns list of ids of the added texts. vector_store.add_texts(texts = ['test_123'], metadatas =[{'source' :'wiki'}]) #Additionaly, to explore the table you can load it into a df or save it in a csv file: tbl = vector_store.get_table() print("tbl:", tbl) pd_df = tbl.to_pandas() pd_df.to_csv("docsearch.csv", index=False) # you can also create a new vector store object using an older connection object: vector_store = LanceDB(connection=tbl, embedding=embeddings) create_index() This method creates a scalar(for non-vector cols) or a vector index on a table. Name type Purpose defaults vector_col Optional[str] Provide if you want to create index on a vector column. None col_name Optional[str] Provide if you want to create index on a non-vector column. None metric Optional[str] Provide the metric to use for vector index. choice of metrics: 'l2', 'dot', 'cosine'. l2 num_partitions Optional[int] Number of partitions to use for the index. 256 num_sub_vectors Optional[int] Number of sub-vectors to use for the index. 96 index_cache_size Optional[int] Size of the index cache. None name Optional[str] Name of the table to create index on. None For index creation make sure your table has enough data in it. An ANN index is ususally not needed for datasets ~100K vectors. For large-scale (>1M) or higher dimension vectors, it is beneficial to create an ANN index. # for creating vector index vector_store.create_index(vector_col='vector', metric = 'cosine') # for creating scalar index(for non-vector columns) vector_store.create_index(col_name='text') similarity_search() This method performs similarity search based on text query. Name Type Purpose Default query str A str representing the text query that you want to search for in the vector store. N/A k Optional[int] It specifies the number of documents to return. None filter Optional[Dict[str, str]] It is used to filter the search results by specific metadata criteria. None fts Optional[bool] It indicates whether to perform a full-text search (FTS). False name Optional[str] It is used for specifying the name of the table to query. If not provided, it uses the default table set during the initialization of the LanceDB instance. None kwargs Any Other keyworded arguments provided by the user. N/A Return documents most similar to the query without relevance scores. docs = docsearch.similarity_search(query) print(docs[0].page_content) similarity_search_by_vector() The method returns documents that are most similar to the specified embedding (query) vector. Name Type Purpose Default embedding List[float] The embedding vector you want to use to search for similar documents in the vector store. N/A k Optional[int] It specifies the number of documents to return. None filter Optional[Dict[str, str]] It is used to filter the search results by specific metadata criteria. None name Optional[str] It is used for specifying the name of the table to query. If not provided, it uses the default table set during the initialization of the LanceDB instance. None kwargs Any Other keyworded arguments provided by the user. N/A It does not provide relevance scores. docs = docsearch.similarity_search_by_vector(query) print(docs[0].page_content) similarity_search_with_score() Returns documents most similar to the query string along with their relevance scores. Name Type Purpose Default query str A str representing the text query you want to search for in the vector store. This query will be converted into an embedding using the specified embedding function. N/A k Optional[int] It specifies the number of documents to return. None filter Optional[Dict[str, str]] It is used to filter the search results by specific metadata criteria. This allows you to narrow down the search results based on certain metadata attributes associated with the documents. None kwargs Any Other keyworded arguments provided by the user. N/A It gets called by base class's similarity_search_with_relevance_scores which selects relevance score based on our _select_relevance_score_fn. docs = docsearch.similarity_search_with_relevance_scores(query) print("relevance score - ", docs[0][1]) print("text- ", docs[0][0].page_content[:1000]) similarity_search_by_vector_with_relevance_scores() Similarity search using query vector. Name Type Purpose Default embedding List[float] The embedding vector you want to use to search for similar documents in the vector store. N/A k Optional[int] It specifies the number of documents to return. None filter Optional[Dict[str, str]] It is used to filter the search results by specific metadata criteria. None name Optional[str] It is used for specifying the name of the table to query. None kwargs Any Other keyworded arguments provided by the user. N/A The method returns documents most similar to the specified embedding (query) vector, along with their relevance scores. docs = docsearch.similarity_search_by_vector_with_relevance_scores(query_embedding) print("relevance score - ", docs[0][1]) print("text- ", docs[0][0].page_content[:1000]) max_marginal_relevance_search() This method returns docs selected using the maximal marginal relevance(MMR). Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents. Name Type Purpose Default query str Text to look up documents similar to. N/A k Optional[int] Number of Documents to return. 4 fetch_k Optional[int] Number of Documents to fetch to pass to MMR algorithm. None lambda_mult float Number between 0 and 1 that determines the degree of diversity among the results with 0 corresponding to maximum diversity and 1 to minimum diversity. 0.5 filter Optional[Dict[str, str]] Filter by metadata. None kwargs Other keyworded arguments provided by the user. - Similarly, max_marginal_relevance_search_by_vector() function returns docs most similar to the embedding passed to the function using MMR. instead of a string query you need to pass the embedding to be searched for. result = docsearch.max_marginal_relevance_search( query="text" ) result_texts = [doc.page_content for doc in result] print(result_texts) ## search by vector : result = docsearch.max_marginal_relevance_search_by_vector( embeddings.embed_query("text") ) result_texts = [doc.page_content for doc in result] print(result_texts) add_images() This method ddds images by automatically creating their embeddings and adds them to the vectorstore. Name Type Purpose Default uris List[str] File path to the image N/A metadatas Optional[List[dict]] Optional list of metadatas None ids Optional[List[str]] Optional list of IDs None It returns list of IDs of the added images. vec_store.add_images(uris=image_uris) # here image_uris are local fs paths to the images.