Traditional recommendation engines usually depend on metadata—titles, tags, or transcripts. While useful, those signals don’t capture the deeper meaning of what’s happening inside a video. Imagine if your system could actually understand the visuals, audio, and context of the content itself.
That’s exactly what we’ll build in this tutorial: a semantic video recommendation engine powered by TwelveLabs , LanceDB , and Geneva .
- TwelveLabs provides multimodal embeddings that encode the narrative, mood, and actions in a video, going far beyond keyword matching.
- LanceDB stores these embeddings together with metadata and supports fast vector search through a developer-friendly Python API.
- Geneva , built on LanceDB and powered by Ray , scales the entire pipeline seamlessly from a single laptop to a large distributed cluster—without changing your code.
Why this stack?
- TwelveLabs : captures narrative flow and meaning, enabling natural queries like “a surfer riding a wave at sunset” to return relevant matches even without explicit tags.
-
LanceDB
: a modern vector database built on
Apache Arrow
, with:
- A simple, intuitive Python interface.
- Embedded operation—no external services required.
- Native multimodal support for video, images, text, and vectors in the same table.
- Geneva : extends LanceDB for distributed processing. With Ray under the hood, it parallelizes embedding generation and large-scale searches.
Loading and Materializing Videos
The first step is to load your dataset into LanceDB. Here, we’re pulling in videos from
HuggingFace
’s
HuggingFaceFV/finevideo
dataset.
def load_videos():
dataset = load_dataset("HuggingFaceFV/finevideo", split="train", streaming=True)
batch = []
processed = 0
for row in dataset:
if processed >= 10:
break
video_bytes = row['mp4']
json_metadata = row['json']
batch.append({
"video": video_bytes,
"caption": json_metadata.get("youtube_title", "No description"),
"youtube_title": json_metadata.get("youtube_title", ""),
"video_id": f"video_{processed}",
"duration": json_metadata.get("duration_seconds", 0),
"resolution": json_metadata.get("resolution", "")
})
processed += 1
return pa.RecordBatch.from_pylist(batch)
Here we stream the dataset to save memory and only process 10 rows for the demo. Each item stores the raw video bytes plus helpful metadata, producing a PyArrow RecordBatch that keeps video and metadata together.
Now we persist this dataset into LanceDB using Geneva:
db = geneva.connect("/content/quickstart/")
tbl = db.create_table("videos", load_videos(), mode="overwrite")
geneva.connect()
starts a local LanceDB instance, create_table()
writes the dataset to videos
, and mode="overwrite"
resets the table.
At this point, we have a LanceDB table of videos ready for embedding and search.
Embedding Videos with TwelveLabs
Next, we use TwelveLabs’ Marengo model to generate embeddings from the raw video files.
task = client.embed.tasks.create(
model_name="Marengo-retrieval-2.7",
video_file=video_file,
video_embedding_scope=["clip", "video"]
)
status = client.embed.tasks.wait_for_done(task.id)
result = client.embed.tasks.retrieve(task.id)
video_segments = [seg for seg in result.video_embedding.segments
if seg.embedding_scope == "video"]
embedding_array = np.array(video_segments[0].float_, dtype=np.float32)
Here we submit an embedding job to TwelveLabs, request both clip and whole‑video embeddings, then convert the result to a NumPy array for storage and search.
With Geneva, embeddings are automatically stored as another column:
tbl.add_columns({"embedding": GenVideoEmbeddings(
twelve_labs_api_key=os.environ['TWELVE_LABS_API_KEY']
)})
tbl.backfill("embedding", concurrency=1)
add_columns()
adds an embedding
column powered by GenVideoEmbeddings
, and backfill()
computes it for all rows (increase concurrency
in production).
At this stage, every video in our LanceDB table has a semantic embedding attached.
Searching with LanceDB
With embeddings stored, LanceDB can run vector search queries.
query = "educational tutorial"
query_result = client.embed.create(
model_name="Marengo-retrieval-2.7",
text=query
)
qvec = np.array(query_result.text_embedding.segments[0].float_)
lance_db = lancedb.connect("/content/quickstart/")
lance_tbl = lance_db.open_table("videos")
results = (lance_tbl
.search(qvec)
.metric("cosine")
.limit(3)
.to_pandas())
Here we embed the query into qvec
, open the videos
table, run .search(qvec)
with cosine similarity, and return the top matches as a pandas DataFrame. This is semantic search in action.
Summarizing with Pegasus
Embeddings alone provide similarity, but they don’t explain why a result was returned. For better UX, TwelveLabs also provides Pegasus, a summarization model:
index = client.indexes.create(
index_name=f"lancedb_demo_{int(time.time())}",
models=[{"model_name": "pegasus1.2", "model_options": ["visual", "audio"]}]
)
pegasus1.2
creates short multimodal summaries you can store alongside results to make recommendations easier to understand.
Scaling with Geneva and Ray
Small datasets can be managed manually, but at enterprise scale you need automation. Geneva + Ray handle this:
Concern | LanceDB only | With Geneva and Ray |
---|---|---|
Ingestion | Manual loaders | Declarative pipelines |
Embeddings | Sequential | Parallel across many workers |
Storage | Local tables | Distributed LanceDB tables |
ML/Analytics | Custom scripts | Built-in distributed UDFs |
Here we declare the pipeline once and run it anywhere. Ray parallelizes the work so you can scale from a laptop to a large cluster without changing your code.
Try it out
By combining TwelveLabs , LanceDB , and Geneva you can build a recommendation system that understands video content directly.
- TwelveLabs Playground – Sign up for an API key and start generating video embeddings right away
- LanceDB Quickstart – Install LanceDB locally and try your first vector search with Python
- Geneva Documentation – Learn how to scale pipelines and run distributed embedding jobs with Ray
- Complete Notebook – Explore the full runnable code with all the details