Partnerships

Partnerships

Partnerships

Building a Semantic Video Search Workflow with TwelveLabs and Qdrant

James Le

James Le

James Le

We demonstrate how to build a semantic video search workflow by combining TwelveLabs’ multimodal embedding capabilities with Qdrant’s vector search engine.

We demonstrate how to build a semantic video search workflow by combining TwelveLabs’ multimodal embedding capabilities with Qdrant’s vector search engine.

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

Feb 24, 2025

Feb 24, 2025

Feb 24, 2025

8 Min

8 Min

8 Min

Copy link to article

Copy link to article

Copy link to article

Big thanks to the Qdrant team (David Myriel and Anush Shetty) for collaborating with us on this tutorial.


Introduction

In today’s data-driven world, video content is a rich source of information that combines multiple modalities, including visuals, audio, and text. However, extracting meaningful insights from videos and enabling semantic search across them can be challenging due to their complexity. This is where the integration of TwelveLabs Embed API and Qdrant comes into play.

The TwelveLabs Embed API empowers developers to create multimodal embeddings that capture the essence of video content, including visual expressions, body language, spoken words, and contextual cues. These embeddings are optimized for a unified vector space, enabling seamless cross-modal understanding. On the other hand, Qdrant is a powerful vector similarity search engine that allows you to store and query these embeddings efficiently.

In this tutorial, we’ll demonstrate how to build a semantic video search workflow by combining TwelveLabs’ multimodal embedding capabilities with Qdrant’s vector search engine. By the end of this guide, you’ll be able to:

  • Generate multimodal embeddings for videos using the TwelveLabs Embed API.

  • Store and manage these embeddings in Qdrant.

  • Perform semantic searches across video content using text or other modalities.

This workflow is ideal for applications like video indexing, content recommendation systems, and contextual search engines.


1 - Setting Up The Environment

Before diving into the implementation, let’s set up the necessary tools and libraries. For this tutorial, we’ll use Python in a Colab notebook environment.


Step 1: Install Required SDKs

Run the following command in your Colab notebook to install the TwelveLabs and Qdrant SDKs:

!pip install twelvelabs qdrant-client


Step 2: Configure API Clients

Next, configure the TwelveLabs and Qdrant clients by importing their respective libraries and initializing them with your API keys.

from twelvelabs import TwelveLabs
from qdrant_client import QdrantClient

# Get your API keys from: https://playground.twelvelabs.io/dashboard/api-key
from google.colab import userdata
TL_API_KEY=userdata.get('TL_API_KEY')

twelvelabs_client = TwelveLabs(api_key=TL_API_KEY)
qdrant_client = QdrantClient(":memory:")


2 - Generating Multimodal Embeddings with TwelveLabs

The TwelveLabs Embed API allows you to generate multimodal embeddings that capture the essence of video content across modalities like visuals, audio, and text. These embeddings are represented as high-dimensional vectors, enabling seamless semantic search and cross-modal understanding. In this section, we’ll demonstrate how to use the Marengo-retrieval-2.7 engine to create embeddings for a video.


Step 1: Understanding the Embedding Process

The Marengo-retrieval-2.7 engine is optimized for video-native embeddings with a dimensionality of 1024. It supports cosine similarity for vector comparisons, making it suitable for tasks like semantic search or retrieval. You can also use this engine to embed audio, text, and images into the same vector space, enabling cross-modality searches. For context, the engine is built on top of TwelveLabs’ state-of-the-art video embedding model Marengo 2.7.


Step 2: Embedding a Video

To generate embeddings for a video, follow these steps:

  1. Specify the Video URL: Provide the URL of the video you want to process.

  2. Create an Embedding Task: Use the TwelveLabs client to initiate an embedding task.

  3. Wait for Task Completion: Monitor the task status until it’s complete.

  4. Retrieve the Embeddings: Once completed, retrieve the vector embeddings from the task results.

Here’s the implementation in Python:

# Step 1: Create an embedding task
task = twelvelabs_client.embed.task.create(
    model_name="Marengo-retrieval-2.7",  # Specify the model
    video_url="https://sample-videos.com/video321/mp4/720/big_buck_bunny_720p_2mb.mp4"  # Video URL
)

# Step 2: Wait for the task to complete
task.wait_for_done(sleep_interval=3)  # Check every 3 seconds

# Step 3: Retrieve the embeddings
task_result = twelvelabs_client.embed.task.retrieve(task.id)

# Display the embedding results
print("Embedding Vector (First 10 Dimensions):", task_result.embeddings[:10])
print("Embedding Dimensionality:", len(task_result.embeddings))


Step 3: Cross-Modality Embeddings (Optional)

The same engine can embed other modalities (e.g., text, audio, or images) into a unified vector space. For example:

  • Text: Input descriptive text such as "a person riding a bike".

  • Audio: Use an audio file URL to extract its semantic representation.

  • Image: Provide an image URL for embedding.

This flexibility allows you to perform cross-modality searches, such as querying a video collection using text or audio descriptions.


Step 4: Next Steps

Once you’ve generated embeddings, they can be stored in a vector database like Qdrant for efficient similarity search. In the next section, we’ll explore how to prepare these embeddings for insertion into Qdrant and perform semantic searches across your video content.


3 - Preparing Data for Qdrant

Once you’ve generated multimodal embeddings using the TwelveLabs Embed API, the next step is to prepare these embeddings for insertion into Qdrant. Qdrant operates on points, which consist of a vector, an optional unique ID, and a payload with additional metadata. For this tutorial, we’ll map each video segment’s embedding into Qdrant’s PointStruct format.


Step 1: Extract Embedding Segments

The TwelveLabs Embed API generates embeddings for video segments, each containing:

  • Vector: The high-dimensional embedding.

  • Metadata: Including start and end timestamps (start_offset_sec and end_offset_sec) and the embedding scope.


Step 2: Convert to Qdrant Points

We’ll loop through the video embedding segments and convert them into Qdrant-compatible points. The metadata will be stored in the payload for each point.

Here’s the Python code to achieve this:

from qdrant_client.models import PointStruct

# Convert embedding segments to Qdrant points
points = [
    PointStruct(
        id=idx,  # Unique identifier for each vector
        vector=v.embeddings_float,  # Embedding vector
        payload={
            "start_offset_sec": v.start_offset_sec,  # Start time of the segment
            "end_offset_sec": v.end_offset_sec,      # End time of the segment
            "embedding_scope": v.embedding_scope,   # Scope of the embedding
        },
    )
    for idx, v in enumerate(task_result.video_embedding.segments)
]

print(f"Prepared {len(points)} points for insertion into Qdrant.")

At this stage, your data is ready to be inserted into a Qdrant collection.


4 - Setting Up a Qdrant Collection

Qdrant organizes vectors into collections, which are named sets of points. Each collection has specific parameters such as vector dimensionality and a distance metric (e.g., cosine similarity). Let’s create a collection to store the prepared points.


Step 1: Define Collection Parameters

For this tutorial:

  • Vector Size: 1024 (matches the dimensionality of embeddings from TwelveLabs).

  • Distance Metric: Cosine similarity (optimal for comparing normalized vectors).


Step 2: Create a Collection

Use the following code to create a collection in Qdrant:

from qdrant_client.models import VectorParams, Distance

# Define collection name
collection_name = "twelve_labs_collection"

# Create a collection with specified parameters
qdrant_client.create_collection(
    collection_name,
    vectors_config=VectorParams(
        size=1024,  # Dimensionality of vectors
        distance=Distance.COSINE,  # Similarity metric
    ),
)

print(f"Collection '{collection_name}' created successfully.")


Step 3: Insert Points into the Collection

Now that the collection is set up, insert the prepared points:

# Insert points into the collection
qdrant_client.upsert(collection_name, points)

print(f"Inserted {len(points)} points into '{collection_name}'.")


Summary

At this point:

  1. The embeddings have been converted into Qdrant-compatible points.

  2. A new collection has been created with appropriate parameters.

  3. The points have been successfully inserted into the collection.

In the next section, we’ll demonstrate how to query this collection to perform semantic searches across your video content.


5 - Performing Semantic Searches

With the embeddings stored in Qdrant, you can now perform semantic searches across different modalities, such as text, audio, and images. This section demonstrates how to query the Qdrant collection using embeddings generated by the TwelveLabs Embed API.


Step 1: Querying with Text

Text-based queries allow you to search for video segments that semantically match a given textual description. For example, let’s search for segments related to "A white rabbit."

# Generate text embedding
text_segment = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    text="A white rabbit",  # Input query
).text_embedding.segments[0]

# Perform semantic search in Qdrant
text_results = qdrant_client.query_points(
    collection_name=collection_name,
    query=text_segment.embeddings_float,  # Use the embedding vector
)

print("Text Query Results:", text_results)


Step 2: Querying with Audio

Audio-based queries allow you to search for video segments that match the semantic content of an audio clip. For instance, here’s how to use an audio file as a query:

# Generate audio embedding
audio_segment = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    audio_url="https://codeskulptor-demos.commondatastorage.googleapis.com/descent/background%20music.mp3",  # Audio file URL
).audio_embedding.segments[0]

# Perform semantic search in Qdrant
audio_results = qdrant_client.query_points(
    collection_name=collection_name,
    query=audio_segment.embeddings_float,  # Use the embedding vector
)

print("Audio Query Results:", audio_results)


Step 3: Querying with an Image

Image-based queries enable you to find video segments that are semantically similar to a given image. For example:

# Generate image embedding
image_segment = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    image_url="https://gratisography.com/wp-content/uploads/2024/01/gratisography-cyber-kitty-1170x780.jpg",  # Image URL
).image_embedding.segments[0]

# Perform semantic search in Qdrant
image_results = qdrant_client.query_points(
    collection_name=collection_name,
    query=image_segment.embeddings_float,  # Use the embedding vector
)

print("Image Query Results:", image_results)


Summary

By leveraging the TwelveLabs Embed API and Qdrant:

  • You can perform cross-modal searches using text, audio, or images as queries.

  • The unified vector space ensures that embeddings from different modalities are comparable, enabling seamless multimodal retrieval.


6 - Conclusion and Next Steps

In this tutorial, we explored how to build a semantic video search workflow by integrating the TwelveLabs Embed API with Qdrant. By generating multimodal embeddings from video content and leveraging Qdrant’s vector database, we created a powerful system capable of performing semantic searches across modalities such as text, audio, and images. This workflow demonstrates the potential of combining advanced AI models with scalable vector search technology to unlock new possibilities in video understanding and retrieval.


Key Takeaways
  • The TwelveLabs Embed API provides state-of-the-art multimodal embeddings that capture the essence of video content across visual, audio, and textual modalities.

  • Qdrant enables efficient storage and similarity search of these embeddings using its flexible collection structure and high-performance query capabilities.

  • The unified vector space created by TwelveLabs’ models allows for seamless cross-modal searches, making it possible to query videos using text descriptions, audio clips, or images.


Call to Action

As we move forward, we encourage developers and businesses to explore the combined power of TwelveLabs and Qdrant for building next-generation AI applications. Whether it’s semantic video search, personalized recommendations, or innovative RAG workflows, this partnership is poised to redefine how we interact with multimodal data.

Big thanks to the Qdrant team (David Myriel and Anush Shetty) for collaborating with us on this tutorial.


Introduction

In today’s data-driven world, video content is a rich source of information that combines multiple modalities, including visuals, audio, and text. However, extracting meaningful insights from videos and enabling semantic search across them can be challenging due to their complexity. This is where the integration of TwelveLabs Embed API and Qdrant comes into play.

The TwelveLabs Embed API empowers developers to create multimodal embeddings that capture the essence of video content, including visual expressions, body language, spoken words, and contextual cues. These embeddings are optimized for a unified vector space, enabling seamless cross-modal understanding. On the other hand, Qdrant is a powerful vector similarity search engine that allows you to store and query these embeddings efficiently.

In this tutorial, we’ll demonstrate how to build a semantic video search workflow by combining TwelveLabs’ multimodal embedding capabilities with Qdrant’s vector search engine. By the end of this guide, you’ll be able to:

  • Generate multimodal embeddings for videos using the TwelveLabs Embed API.

  • Store and manage these embeddings in Qdrant.

  • Perform semantic searches across video content using text or other modalities.

This workflow is ideal for applications like video indexing, content recommendation systems, and contextual search engines.


1 - Setting Up The Environment

Before diving into the implementation, let’s set up the necessary tools and libraries. For this tutorial, we’ll use Python in a Colab notebook environment.


Step 1: Install Required SDKs

Run the following command in your Colab notebook to install the TwelveLabs and Qdrant SDKs:

!pip install twelvelabs qdrant-client


Step 2: Configure API Clients

Next, configure the TwelveLabs and Qdrant clients by importing their respective libraries and initializing them with your API keys.

from twelvelabs import TwelveLabs
from qdrant_client import QdrantClient

# Get your API keys from: https://playground.twelvelabs.io/dashboard/api-key
from google.colab import userdata
TL_API_KEY=userdata.get('TL_API_KEY')

twelvelabs_client = TwelveLabs(api_key=TL_API_KEY)
qdrant_client = QdrantClient(":memory:")


2 - Generating Multimodal Embeddings with TwelveLabs

The TwelveLabs Embed API allows you to generate multimodal embeddings that capture the essence of video content across modalities like visuals, audio, and text. These embeddings are represented as high-dimensional vectors, enabling seamless semantic search and cross-modal understanding. In this section, we’ll demonstrate how to use the Marengo-retrieval-2.7 engine to create embeddings for a video.


Step 1: Understanding the Embedding Process

The Marengo-retrieval-2.7 engine is optimized for video-native embeddings with a dimensionality of 1024. It supports cosine similarity for vector comparisons, making it suitable for tasks like semantic search or retrieval. You can also use this engine to embed audio, text, and images into the same vector space, enabling cross-modality searches. For context, the engine is built on top of TwelveLabs’ state-of-the-art video embedding model Marengo 2.7.


Step 2: Embedding a Video

To generate embeddings for a video, follow these steps:

  1. Specify the Video URL: Provide the URL of the video you want to process.

  2. Create an Embedding Task: Use the TwelveLabs client to initiate an embedding task.

  3. Wait for Task Completion: Monitor the task status until it’s complete.

  4. Retrieve the Embeddings: Once completed, retrieve the vector embeddings from the task results.

Here’s the implementation in Python:

# Step 1: Create an embedding task
task = twelvelabs_client.embed.task.create(
    model_name="Marengo-retrieval-2.7",  # Specify the model
    video_url="https://sample-videos.com/video321/mp4/720/big_buck_bunny_720p_2mb.mp4"  # Video URL
)

# Step 2: Wait for the task to complete
task.wait_for_done(sleep_interval=3)  # Check every 3 seconds

# Step 3: Retrieve the embeddings
task_result = twelvelabs_client.embed.task.retrieve(task.id)

# Display the embedding results
print("Embedding Vector (First 10 Dimensions):", task_result.embeddings[:10])
print("Embedding Dimensionality:", len(task_result.embeddings))


Step 3: Cross-Modality Embeddings (Optional)

The same engine can embed other modalities (e.g., text, audio, or images) into a unified vector space. For example:

  • Text: Input descriptive text such as "a person riding a bike".

  • Audio: Use an audio file URL to extract its semantic representation.

  • Image: Provide an image URL for embedding.

This flexibility allows you to perform cross-modality searches, such as querying a video collection using text or audio descriptions.


Step 4: Next Steps

Once you’ve generated embeddings, they can be stored in a vector database like Qdrant for efficient similarity search. In the next section, we’ll explore how to prepare these embeddings for insertion into Qdrant and perform semantic searches across your video content.


3 - Preparing Data for Qdrant

Once you’ve generated multimodal embeddings using the TwelveLabs Embed API, the next step is to prepare these embeddings for insertion into Qdrant. Qdrant operates on points, which consist of a vector, an optional unique ID, and a payload with additional metadata. For this tutorial, we’ll map each video segment’s embedding into Qdrant’s PointStruct format.


Step 1: Extract Embedding Segments

The TwelveLabs Embed API generates embeddings for video segments, each containing:

  • Vector: The high-dimensional embedding.

  • Metadata: Including start and end timestamps (start_offset_sec and end_offset_sec) and the embedding scope.


Step 2: Convert to Qdrant Points

We’ll loop through the video embedding segments and convert them into Qdrant-compatible points. The metadata will be stored in the payload for each point.

Here’s the Python code to achieve this:

from qdrant_client.models import PointStruct

# Convert embedding segments to Qdrant points
points = [
    PointStruct(
        id=idx,  # Unique identifier for each vector
        vector=v.embeddings_float,  # Embedding vector
        payload={
            "start_offset_sec": v.start_offset_sec,  # Start time of the segment
            "end_offset_sec": v.end_offset_sec,      # End time of the segment
            "embedding_scope": v.embedding_scope,   # Scope of the embedding
        },
    )
    for idx, v in enumerate(task_result.video_embedding.segments)
]

print(f"Prepared {len(points)} points for insertion into Qdrant.")

At this stage, your data is ready to be inserted into a Qdrant collection.


4 - Setting Up a Qdrant Collection

Qdrant organizes vectors into collections, which are named sets of points. Each collection has specific parameters such as vector dimensionality and a distance metric (e.g., cosine similarity). Let’s create a collection to store the prepared points.


Step 1: Define Collection Parameters

For this tutorial:

  • Vector Size: 1024 (matches the dimensionality of embeddings from TwelveLabs).

  • Distance Metric: Cosine similarity (optimal for comparing normalized vectors).


Step 2: Create a Collection

Use the following code to create a collection in Qdrant:

from qdrant_client.models import VectorParams, Distance

# Define collection name
collection_name = "twelve_labs_collection"

# Create a collection with specified parameters
qdrant_client.create_collection(
    collection_name,
    vectors_config=VectorParams(
        size=1024,  # Dimensionality of vectors
        distance=Distance.COSINE,  # Similarity metric
    ),
)

print(f"Collection '{collection_name}' created successfully.")


Step 3: Insert Points into the Collection

Now that the collection is set up, insert the prepared points:

# Insert points into the collection
qdrant_client.upsert(collection_name, points)

print(f"Inserted {len(points)} points into '{collection_name}'.")


Summary

At this point:

  1. The embeddings have been converted into Qdrant-compatible points.

  2. A new collection has been created with appropriate parameters.

  3. The points have been successfully inserted into the collection.

In the next section, we’ll demonstrate how to query this collection to perform semantic searches across your video content.


5 - Performing Semantic Searches

With the embeddings stored in Qdrant, you can now perform semantic searches across different modalities, such as text, audio, and images. This section demonstrates how to query the Qdrant collection using embeddings generated by the TwelveLabs Embed API.


Step 1: Querying with Text

Text-based queries allow you to search for video segments that semantically match a given textual description. For example, let’s search for segments related to "A white rabbit."

# Generate text embedding
text_segment = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    text="A white rabbit",  # Input query
).text_embedding.segments[0]

# Perform semantic search in Qdrant
text_results = qdrant_client.query_points(
    collection_name=collection_name,
    query=text_segment.embeddings_float,  # Use the embedding vector
)

print("Text Query Results:", text_results)


Step 2: Querying with Audio

Audio-based queries allow you to search for video segments that match the semantic content of an audio clip. For instance, here’s how to use an audio file as a query:

# Generate audio embedding
audio_segment = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    audio_url="https://codeskulptor-demos.commondatastorage.googleapis.com/descent/background%20music.mp3",  # Audio file URL
).audio_embedding.segments[0]

# Perform semantic search in Qdrant
audio_results = qdrant_client.query_points(
    collection_name=collection_name,
    query=audio_segment.embeddings_float,  # Use the embedding vector
)

print("Audio Query Results:", audio_results)


Step 3: Querying with an Image

Image-based queries enable you to find video segments that are semantically similar to a given image. For example:

# Generate image embedding
image_segment = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    image_url="https://gratisography.com/wp-content/uploads/2024/01/gratisography-cyber-kitty-1170x780.jpg",  # Image URL
).image_embedding.segments[0]

# Perform semantic search in Qdrant
image_results = qdrant_client.query_points(
    collection_name=collection_name,
    query=image_segment.embeddings_float,  # Use the embedding vector
)

print("Image Query Results:", image_results)


Summary

By leveraging the TwelveLabs Embed API and Qdrant:

  • You can perform cross-modal searches using text, audio, or images as queries.

  • The unified vector space ensures that embeddings from different modalities are comparable, enabling seamless multimodal retrieval.


6 - Conclusion and Next Steps

In this tutorial, we explored how to build a semantic video search workflow by integrating the TwelveLabs Embed API with Qdrant. By generating multimodal embeddings from video content and leveraging Qdrant’s vector database, we created a powerful system capable of performing semantic searches across modalities such as text, audio, and images. This workflow demonstrates the potential of combining advanced AI models with scalable vector search technology to unlock new possibilities in video understanding and retrieval.


Key Takeaways
  • The TwelveLabs Embed API provides state-of-the-art multimodal embeddings that capture the essence of video content across visual, audio, and textual modalities.

  • Qdrant enables efficient storage and similarity search of these embeddings using its flexible collection structure and high-performance query capabilities.

  • The unified vector space created by TwelveLabs’ models allows for seamless cross-modal searches, making it possible to query videos using text descriptions, audio clips, or images.


Call to Action

As we move forward, we encourage developers and businesses to explore the combined power of TwelveLabs and Qdrant for building next-generation AI applications. Whether it’s semantic video search, personalized recommendations, or innovative RAG workflows, this partnership is poised to redefine how we interact with multimodal data.

Big thanks to the Qdrant team (David Myriel and Anush Shetty) for collaborating with us on this tutorial.


Introduction

In today’s data-driven world, video content is a rich source of information that combines multiple modalities, including visuals, audio, and text. However, extracting meaningful insights from videos and enabling semantic search across them can be challenging due to their complexity. This is where the integration of TwelveLabs Embed API and Qdrant comes into play.

The TwelveLabs Embed API empowers developers to create multimodal embeddings that capture the essence of video content, including visual expressions, body language, spoken words, and contextual cues. These embeddings are optimized for a unified vector space, enabling seamless cross-modal understanding. On the other hand, Qdrant is a powerful vector similarity search engine that allows you to store and query these embeddings efficiently.

In this tutorial, we’ll demonstrate how to build a semantic video search workflow by combining TwelveLabs’ multimodal embedding capabilities with Qdrant’s vector search engine. By the end of this guide, you’ll be able to:

  • Generate multimodal embeddings for videos using the TwelveLabs Embed API.

  • Store and manage these embeddings in Qdrant.

  • Perform semantic searches across video content using text or other modalities.

This workflow is ideal for applications like video indexing, content recommendation systems, and contextual search engines.


1 - Setting Up The Environment

Before diving into the implementation, let’s set up the necessary tools and libraries. For this tutorial, we’ll use Python in a Colab notebook environment.


Step 1: Install Required SDKs

Run the following command in your Colab notebook to install the TwelveLabs and Qdrant SDKs:

!pip install twelvelabs qdrant-client


Step 2: Configure API Clients

Next, configure the TwelveLabs and Qdrant clients by importing their respective libraries and initializing them with your API keys.

from twelvelabs import TwelveLabs
from qdrant_client import QdrantClient

# Get your API keys from: https://playground.twelvelabs.io/dashboard/api-key
from google.colab import userdata
TL_API_KEY=userdata.get('TL_API_KEY')

twelvelabs_client = TwelveLabs(api_key=TL_API_KEY)
qdrant_client = QdrantClient(":memory:")


2 - Generating Multimodal Embeddings with TwelveLabs

The TwelveLabs Embed API allows you to generate multimodal embeddings that capture the essence of video content across modalities like visuals, audio, and text. These embeddings are represented as high-dimensional vectors, enabling seamless semantic search and cross-modal understanding. In this section, we’ll demonstrate how to use the Marengo-retrieval-2.7 engine to create embeddings for a video.


Step 1: Understanding the Embedding Process

The Marengo-retrieval-2.7 engine is optimized for video-native embeddings with a dimensionality of 1024. It supports cosine similarity for vector comparisons, making it suitable for tasks like semantic search or retrieval. You can also use this engine to embed audio, text, and images into the same vector space, enabling cross-modality searches. For context, the engine is built on top of TwelveLabs’ state-of-the-art video embedding model Marengo 2.7.


Step 2: Embedding a Video

To generate embeddings for a video, follow these steps:

  1. Specify the Video URL: Provide the URL of the video you want to process.

  2. Create an Embedding Task: Use the TwelveLabs client to initiate an embedding task.

  3. Wait for Task Completion: Monitor the task status until it’s complete.

  4. Retrieve the Embeddings: Once completed, retrieve the vector embeddings from the task results.

Here’s the implementation in Python:

# Step 1: Create an embedding task
task = twelvelabs_client.embed.task.create(
    model_name="Marengo-retrieval-2.7",  # Specify the model
    video_url="https://sample-videos.com/video321/mp4/720/big_buck_bunny_720p_2mb.mp4"  # Video URL
)

# Step 2: Wait for the task to complete
task.wait_for_done(sleep_interval=3)  # Check every 3 seconds

# Step 3: Retrieve the embeddings
task_result = twelvelabs_client.embed.task.retrieve(task.id)

# Display the embedding results
print("Embedding Vector (First 10 Dimensions):", task_result.embeddings[:10])
print("Embedding Dimensionality:", len(task_result.embeddings))


Step 3: Cross-Modality Embeddings (Optional)

The same engine can embed other modalities (e.g., text, audio, or images) into a unified vector space. For example:

  • Text: Input descriptive text such as "a person riding a bike".

  • Audio: Use an audio file URL to extract its semantic representation.

  • Image: Provide an image URL for embedding.

This flexibility allows you to perform cross-modality searches, such as querying a video collection using text or audio descriptions.


Step 4: Next Steps

Once you’ve generated embeddings, they can be stored in a vector database like Qdrant for efficient similarity search. In the next section, we’ll explore how to prepare these embeddings for insertion into Qdrant and perform semantic searches across your video content.


3 - Preparing Data for Qdrant

Once you’ve generated multimodal embeddings using the TwelveLabs Embed API, the next step is to prepare these embeddings for insertion into Qdrant. Qdrant operates on points, which consist of a vector, an optional unique ID, and a payload with additional metadata. For this tutorial, we’ll map each video segment’s embedding into Qdrant’s PointStruct format.


Step 1: Extract Embedding Segments

The TwelveLabs Embed API generates embeddings for video segments, each containing:

  • Vector: The high-dimensional embedding.

  • Metadata: Including start and end timestamps (start_offset_sec and end_offset_sec) and the embedding scope.


Step 2: Convert to Qdrant Points

We’ll loop through the video embedding segments and convert them into Qdrant-compatible points. The metadata will be stored in the payload for each point.

Here’s the Python code to achieve this:

from qdrant_client.models import PointStruct

# Convert embedding segments to Qdrant points
points = [
    PointStruct(
        id=idx,  # Unique identifier for each vector
        vector=v.embeddings_float,  # Embedding vector
        payload={
            "start_offset_sec": v.start_offset_sec,  # Start time of the segment
            "end_offset_sec": v.end_offset_sec,      # End time of the segment
            "embedding_scope": v.embedding_scope,   # Scope of the embedding
        },
    )
    for idx, v in enumerate(task_result.video_embedding.segments)
]

print(f"Prepared {len(points)} points for insertion into Qdrant.")

At this stage, your data is ready to be inserted into a Qdrant collection.


4 - Setting Up a Qdrant Collection

Qdrant organizes vectors into collections, which are named sets of points. Each collection has specific parameters such as vector dimensionality and a distance metric (e.g., cosine similarity). Let’s create a collection to store the prepared points.


Step 1: Define Collection Parameters

For this tutorial:

  • Vector Size: 1024 (matches the dimensionality of embeddings from TwelveLabs).

  • Distance Metric: Cosine similarity (optimal for comparing normalized vectors).


Step 2: Create a Collection

Use the following code to create a collection in Qdrant:

from qdrant_client.models import VectorParams, Distance

# Define collection name
collection_name = "twelve_labs_collection"

# Create a collection with specified parameters
qdrant_client.create_collection(
    collection_name,
    vectors_config=VectorParams(
        size=1024,  # Dimensionality of vectors
        distance=Distance.COSINE,  # Similarity metric
    ),
)

print(f"Collection '{collection_name}' created successfully.")


Step 3: Insert Points into the Collection

Now that the collection is set up, insert the prepared points:

# Insert points into the collection
qdrant_client.upsert(collection_name, points)

print(f"Inserted {len(points)} points into '{collection_name}'.")


Summary

At this point:

  1. The embeddings have been converted into Qdrant-compatible points.

  2. A new collection has been created with appropriate parameters.

  3. The points have been successfully inserted into the collection.

In the next section, we’ll demonstrate how to query this collection to perform semantic searches across your video content.


5 - Performing Semantic Searches

With the embeddings stored in Qdrant, you can now perform semantic searches across different modalities, such as text, audio, and images. This section demonstrates how to query the Qdrant collection using embeddings generated by the TwelveLabs Embed API.


Step 1: Querying with Text

Text-based queries allow you to search for video segments that semantically match a given textual description. For example, let’s search for segments related to "A white rabbit."

# Generate text embedding
text_segment = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    text="A white rabbit",  # Input query
).text_embedding.segments[0]

# Perform semantic search in Qdrant
text_results = qdrant_client.query_points(
    collection_name=collection_name,
    query=text_segment.embeddings_float,  # Use the embedding vector
)

print("Text Query Results:", text_results)


Step 2: Querying with Audio

Audio-based queries allow you to search for video segments that match the semantic content of an audio clip. For instance, here’s how to use an audio file as a query:

# Generate audio embedding
audio_segment = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    audio_url="https://codeskulptor-demos.commondatastorage.googleapis.com/descent/background%20music.mp3",  # Audio file URL
).audio_embedding.segments[0]

# Perform semantic search in Qdrant
audio_results = qdrant_client.query_points(
    collection_name=collection_name,
    query=audio_segment.embeddings_float,  # Use the embedding vector
)

print("Audio Query Results:", audio_results)


Step 3: Querying with an Image

Image-based queries enable you to find video segments that are semantically similar to a given image. For example:

# Generate image embedding
image_segment = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    image_url="https://gratisography.com/wp-content/uploads/2024/01/gratisography-cyber-kitty-1170x780.jpg",  # Image URL
).image_embedding.segments[0]

# Perform semantic search in Qdrant
image_results = qdrant_client.query_points(
    collection_name=collection_name,
    query=image_segment.embeddings_float,  # Use the embedding vector
)

print("Image Query Results:", image_results)


Summary

By leveraging the TwelveLabs Embed API and Qdrant:

  • You can perform cross-modal searches using text, audio, or images as queries.

  • The unified vector space ensures that embeddings from different modalities are comparable, enabling seamless multimodal retrieval.


6 - Conclusion and Next Steps

In this tutorial, we explored how to build a semantic video search workflow by integrating the TwelveLabs Embed API with Qdrant. By generating multimodal embeddings from video content and leveraging Qdrant’s vector database, we created a powerful system capable of performing semantic searches across modalities such as text, audio, and images. This workflow demonstrates the potential of combining advanced AI models with scalable vector search technology to unlock new possibilities in video understanding and retrieval.


Key Takeaways
  • The TwelveLabs Embed API provides state-of-the-art multimodal embeddings that capture the essence of video content across visual, audio, and textual modalities.

  • Qdrant enables efficient storage and similarity search of these embeddings using its flexible collection structure and high-performance query capabilities.

  • The unified vector space created by TwelveLabs’ models allows for seamless cross-modal searches, making it possible to query videos using text descriptions, audio clips, or images.


Call to Action

As we move forward, we encourage developers and businesses to explore the combined power of TwelveLabs and Qdrant for building next-generation AI applications. Whether it’s semantic video search, personalized recommendations, or innovative RAG workflows, this partnership is poised to redefine how we interact with multimodal data.