Author
James Le, Manish Maheshwari
Date Published
August 14, 2024
Tags
API Tutorial
Embed API
Embeddings
Database
Developers
Open Source
Vector Database
Video understanding
Semantic Search
Partnership
Share
Join our newsletter
You’re now subscribed to the Twelve Labs Newsletter! You'll be getting the latest news and updates in video understanding.
Oh no, something went wrong.
Please try again.
TLDR: Learn how to integrate Twelve Labs' Embed API with LanceDB to build advanced AI applications for video understanding and retrieval. This guide covers everything from setting up your environment and generating video embeddings to storing and querying them efficiently. Big thanks to the LanceDB team (Jasmine Wang and Ayush Chaurasia) for collaborating with us on this tutorial.

‍

Introduction

Welcome to this tutorial on integrating Twelve Labs' Embed API with LanceDB for advanced video understanding and retrieval. This guide demonstrates how to leverage these two powerful technologies to create AI applications that can process and analyze video content with unprecedented accuracy and efficiency.

Twelve Labs' Embed API generates rich, multimodal embeddings that capture video content's essence, while LanceDB is a serverless vector database that excels in storing, indexing, and querying these high-dimensional vectors at scale. Combining these technologies enables the creation of applications like semantic video search engines, content-based recommendation systems, and anomaly detection in video streams.

‍

Setup and Installation

Before we dive into the core functionality, let's set up our environment and install the necessary libraries.

‍

Installing Required Libraries

We'll start by installing the Twelve Labs SDK and LanceDB. Run the following command in your notebook cell:

!pip install twelvelabs lancedb

This command installs both libraries, giving us access to Twelve Labs' video processing capabilities and LanceDB's vector storage and querying functionalities.

‍

Authenticating with Twelve Labs

Next, we need to set up authentication for the Twelve Labs API. We'll use environment variables to securely store the API key:

from twelvelabs import TwelveLabs
import os

# Retrieve the API key from environment variables
TWELVE_LABS_API_KEY = os.getenv('TWELVE_LABS_API_KEY')

# Initialize the Twelve Labs client
twelvelabs_client = TwelveLabs(api_key=TWELVE_LABS_API_KEY)

Note: Ensure you have set the TWELVE_LABS_API_KEY environment variable with your actual API key before running this code.

export TWELVE_LABS_API_KEY='your_api_key_here'
Initializing LanceDB

Now, let's initialize our connection to LanceDB:

import lancedb
import pyarrow as pa

# Create or connect to a LanceDB database
db = lancedb.connect("my_video_embeddings_db")

# Define the schema for your video embeddings
schema = pa.schema(
    [
        pa.field("video_id", pa.string()),
        pa.field("embedding", pa.list_(pa.float32(), 1024)),
        pa.field("start_time", pa.float32()),
        pa.field("end_time", pa.float32()),
        pa.field("video_url", pa.string())
    ]
)

# Create a table for storing video embeddings
table = db.create_table("video_embeddings", schema=schema, mode="overwrite")

This code creates a new LanceDB database (or connects to an existing one) and sets up a table specifically for storing our video embeddings.

  • We import the pyarrow library and use pa.schema to define the schema in the format expected by the lancedb library. We can also define schema using Pydantic.
  • We define the schema for our table, which includes:some text
    • video_id: A unique identifier for each embedding
    • embedding: The vector representation of the video segment (1024 is the dimension of video embeddings provided by Twelve Labs Embed API)
    • start_time and end_time: The temporal information for the video segment
    • video_url: The source of the video
  • The create_table method is called with the table name, the defined schema, and the mode set to "overwrite" to recreate the table if it already exists.

‍

Generating Video Embeddings with Twelve Labs Embed API

In this section, we'll use the Twelve Labs Embed API to generate multimodal embeddings from video content. These embeddings capture the rich context of videos, including visual elements, audio, and temporal information.

First, let's define a function to generate embeddings for a given video URL:

from twelvelabs.models.embed import EmbeddingsTask

def generate_embedding(video_url: str) -> tuple[List[Dict[str, Any]], Any]:
    """Generate embeddings for a given video URL."""
    task = twelvelabs_client.embed.task.create(
        engine_name="Marengo-retrieval-2.6",
        video_url=video_url
    )
    
    def on_task_update(task: EmbeddingsTask):
        print(f"  Status={task.status}")

    task.wait_for_done(sleep_interval=2, callback=on_task_update)
    task_result = twelvelabs_client.embed.task.retrieve(task.id)

    embeddings = [{
        'embedding': v.embedding.float,
        'start_offset_sec': v.start_offset_sec,
        'end_offset_sec': v.end_offset_sec,
        'embedding_scope': v.embedding_scope
    } for v in task_result.video_embeddings]
    
    return embeddings, task_result

This function creates an embedding task, monitors its progress, and retrieves the results once complete. It returns a list of embeddings along with their temporal information and scope.

Now, let's use this function to generate embeddings for a sample video:

# Example usage
video_url = "https://example.com/your-video.mp4"

# Generate embeddings for the video
embeddings, task_result = generate_embedding(video_url)

print(f"Generated {len(embeddings)} embeddings for the video")
for i, emb in enumerate(embeddings):
    print(f"Embedding {i+1}:")
    print(f"  Scope: {emb['embedding_scope']}")
    print(f"  Time range: {emb['start_offset_sec']} - {emb['end_offset_sec']} seconds")
    print(f"  Embedding vector (first 5 values): {emb['embedding'][:5]}")
    print()

This code snippet demonstrates how to generate embeddings for a video and inspect the results. Each embedding represents a segment of the video, capturing its multimodal characteristics.

Understanding the Embeddings:

  1. Embedding Vector: A high-dimensional representation of the video segment's content, capturing visual, audio, and contextual information.
  2. Time Range: The start and end times of the video segment represented by the embedding.
  3. Embedding Scope: Indicates the type of embedding (e.g., "video" for full video embeddings).

‍

Storing Embeddings in LanceDB

After generating video embeddings using Twelve Labs' Embed API, the next crucial step is to store these embeddings efficiently in LanceDB. This allows for quick retrieval and enables various downstream tasks such as similarity search and content recommendation.

Let's walk through the process of storing our video embeddings in LanceDB:

import uuid

def insert_embeddings(embeddings: List[Dict[str, Any]], video_url: str) -> None:
    """Insert embeddings into the database."""
    data_to_insert = [{
        "id": str(uuid.uuid4()),
        "embedding": emb['embedding'],
        "start_time": emb['start_offset_sec'],
        "end_time": emb['end_offset_sec'],
        "video_url": video_url
    } for emb in embeddings]
    
    table.add(data_to_insert)
    print(f"Inserted {len(data_to_insert)} embeddings for video: {video_url}")

# Use the function to insert the embeddings we generated earlier
insert_embeddings(embeddings, video_url)

# Verify the insertion
print(f"Total embeddings in the table: {len(table)}")

Let's break down what this code does:

  1. We create a function insert_embeddings that takes our generated embeddings and the video URL, formats them for insertion into LanceDB, and adds them to the table.
  2. We use this function to insert the embeddings we generated in the previous step.
  3. Finally, we verify the insertion by checking the total number of embeddings in the table.

This approach allows us to store not just the embedding vectors, but also crucial metadata like the temporal information and the source video URL. This additional information will be valuable for retrieval and contextualization in later steps of our workflow.

‍

Querying and Retrieving Embeddings

Now that we have stored our video embeddings in LanceDB, we can leverage its powerful querying capabilities to perform similarity searches and retrieve relevant video segments. This is a crucial step in building applications like semantic video search, content recommendations, or retrieval-augmented generation systems.

Let's explore how to query and retrieve embeddings from LanceDB:

def similarity_search(query_embedding: List[float], k: int = 5) -> List[Dict[str, Any]]:
    """Perform a similarity search on the embeddings table."""
    return table.search(query_embedding).limit(k).to_list()

def get_text_embedding(text_query: str) -> List[float]:
    """Generate a text embedding for a given text query."""
    return twelvelabs_client.embed.text(text_query).embedding.float

# Example usage: Text-based search
text_query = "A person explaining machine learning concepts"
query_embedding = get_text_embedding(text_query)
search_results = similarity_search(query_embedding)

print(f"Top 5 results for query: '{text_query}'")
for i, result in enumerate(search_results):
    print(f"Result {i+1}:")
    print(f"  Video URL: {result['video_url']}")
    print(f"  Segment: {result['start_time']} - {result['end_time']} seconds")
    print(f"  Similarity Score: {result['_distance']}")
    print()

def get_video_segment(video_url: str, start_time: float, end_time: float) -> str:
    """Retrieve a video segment URL from a video URL using start and end times."""
    return f"{video_url}?start={start_time}&end={end_time}"

# Retrieve specific video segment for the top result
top_result = search_results[0]
video_segment = get_video_segment(top_result['video_url'], 
                                  top_result['start_time'], 
                                  top_result['end_time'])
print(f"Link to top result video segment: {video_segment}")

Let's break down the key components of this code:

  1. Similarity Search Function: We define a similarity_search function that takes a query embedding and returns the k most similar embeddings from our LanceDB table.
  2. Text-to-Embedding: The get_text_embedding function demonstrates how we convert a text query into an embedding using Twelve Labs' Embed API. This allows us to perform text-based searches on our video content.
  3. Querying Process:
    • We start with a text query.
    • Convert it to an embedding.
    • Perform a similarity search in LanceDB.
    • Retrieve and display the top results, including the video URL, segment timestamps, and similarity scores.
  4. Video Segment Retrieval: We include a placeholder function get_video_segment to demonstrate how you might generate a link or reference to a specific video segment based on the search results.

In a real-world application, you might want to add more sophisticated filtering options, such as:

# Example of filtering results by metadata
filtered_results = table.search(query_embedding).where("start_time > 10 AND end_time < 60").limit(5).to_list()

This would allow you to combine embedding-based similarity search with traditional metadata filtering for more precise results.

‍

Integrating with RAG Workflow

Here's a simple example of how to integrate our video embedding system into a Retrieval Augmented Generation (RAG) workflow:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLMv

# Constants
MODEL_NAME = "google/flan-t5-base"
MAX_QUERY_TOKENS = 77

# Initialize model, tokenizer, and database
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

def generate_response(prompt: str, max_length: int = 100) -> str:
    """Generate a response using FLAN-T5 model."""
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_length=max_length, num_return_sequences=1)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def retrieve_relevant_segments(query: str, k: int = 3) -> list:
    """Retrieve relevant video segments based on the query."""
    truncated_query = ' '.join(query.split()[:MAX_QUERY_TOKENS])
    query_embedding = get_text_embedding(truncated_query)
    return similarity_search(query_embedding, k)

def create_context(results: list) -> str:
    """Create context from retrieved video segments."""
    return "\n".join(
        f"Video segment from {result['start_time']} to {result['end_time']} seconds: {result['video_url']}"
        for result in results
    )

def rag_workflow(user_query: str) -> str:
    """Execute the RAG (Retrieval-Augmented Generation) workflow."""
    relevant_segments = retrieve_relevant_segments(user_query)
    context = create_context(relevant_segments)
    
    prompt = f"""Context:
{context}

User Query: {user_query}

Based on the context provided about the video segments, please answer the user's query:"""
    
    return generate_response(prompt)

# Example usage
user_query = "What is the purpose of the video?"
rag_response = rag_workflow(user_query)
print("User Query:", user_query)
print("\nRAG Response:", rag_response)

This code demonstrates a basic RAG workflow integrating our video embedding system:

  1. We start by setting up the FLAN-T5 model, an open-source LLM that's easily accessible for developers.
  2. The retrieve_relevant_segments function uses our previously defined similarity search to find video segments relevant to the user's query.
  3. create_context formats the retrieved video segment information into a context string.
  4. The rag_workflow function ties everything together:
    • It retrieves relevant video segments based on the user's query.
    • Creates a context from these segments.
    • Constructs a prompt combining the context and the user's query.
    • Uses the FLAN-T5 model to generate a response based on this enriched prompt.
  5. Finally, we demonstrate the workflow with an example query about machine learning concepts in the video.

This RAG system allows the language model to provide more informed and contextually relevant responses by leveraging the video content we've indexed. It's particularly powerful for applications like writing a video summary, answering specific questions about video content, and providing insights or explanations based on video material (similar to what you can accomplish with the Twelve Labs Generate API).

‍

Performance Considerations and Best Practices

When working with Twelve Labs' Embed API and LanceDB, we’d recommend the following best practices:

  • Consider batching your embedding requests for larger video datasets to optimize API usage.
  • Monitor your query performance and adjust the number of results (k) in similarity searches to balance between accuracy and speed.
  • When dealing with a large number of videos, consider implementing a caching strategy for frequently accessed embeddings to reduce API calls and database queries.

For production environments, here are other considerations:

  • Implement error handling and retries for API calls and database operations.
  • Consider using asynchronous processing for embedding generation and database insertions to improve overall system responsiveness.
  • Regularly update your embeddings as Twelve Labs releases new versions of Marengo to benefit from improved accuracy and performance.

‍

Conclusion & Next Steps

This tutorial has demonstrated how to integrate Twelve Labs' Embed API with LanceDB to create a powerful video understanding and retrieval system. We've covered embedding generation, storage, querying, and practical applications like semantic search and RAG workflows. This integration opens up numerous possibilities for building sophisticated video-based AI applications.

To further enhance your project, consider:

‍

Appendix

For your reference and further exploration:

  1. Complete Colab Notebook
  2. Twelve Labs API documentation
  3. LanceDB documentation

We'd love to see what you build! Share your projects and experiences with the Twelve Labs and LanceDB communities. Happy coding!

‍

Generation Examples
No items found.
No items found.
Comparison against existing models
No items found.

Related articles

Building a Shade Finder App: Using Twelve Labs' API to Pinpoint Specific Colors in Videos

Whether you're looking to find the perfect berry-toned lipstick or just curious about spotting specific colors in your videos, this guide will help you leverage cutting-edge AI to do so effortlessly.

Meeran Kim
A Recap of Denver Multimodal AI Hackathon

We had fun interacting with the AI community in Denver!

James Le
Advanced Video Search: Leveraging Twelve Labs and Milvus for Semantic Retrieval

Harness the power of Twelve Labs' advanced multimodal embeddings and Milvus' efficient vector database to create a robust video search solution.

James Le, Manish Maheshwari
Building Semantic Video Search with Twelve Labs Embed API and MongoDB Atlas

Learn how to create a powerful semantic video search application by combining Twelve Labs' advanced multimodal embeddings with MongoDB Atlas Vector Search.

James Le, Manish Maheshwari