Partnerships

Partnerships

Partnerships

Multivector Video Retrieval with TwelveLabs and Vespa

James Le

James Le

James Le

Integrating TwelveLabs' Embed API with Vespa unlocks the potential for advanced semantic video search applications. The Embed API provides rich multimodal embeddings that capture the temporal and contextual nuances of video content, while Vespa's robust indexing and hybrid ranking capabilities enable efficient retrieval of relevant segments.

Integrating TwelveLabs' Embed API with Vespa unlocks the potential for advanced semantic video search applications. The Embed API provides rich multimodal embeddings that capture the temporal and contextual nuances of video content, while Vespa's robust indexing and hybrid ranking capabilities enable efficient retrieval of relevant segments.

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

Mar 3, 2025

Mar 3, 2025

Mar 3, 2025

9 Min

9 Min

9 Min

Copy link to article

Copy link to article

Copy link to article

Introduction

In the era of multimodal content, extracting meaningful insights from video data requires sophisticated tools that can process and interpret multiple modalities such as text, audio, and visuals. TwelveLabs' Embed API empowers developers to generate rich, multimodal embeddings that encapsulate the essence of video content, including visual expressions, spoken words, and contextual interactions. These embeddings enable advanced applications like semantic video search by providing a unified vector representation of videos.

On the other hand, Vespa, a platform designed for low-latency computation over large datasets, excels in indexing and querying structured and vector data. With its support for approximate nearest neighbor (ANN) search and hybrid ranking capabilities, Vespa is an ideal partner for deploying scalable video search solutions.

This tutorial demonstrates how to integrate TwelveLabs' Embed API with Vespa to build a semantic video search application. By combining the strengths of both platforms, we can efficiently index video embeddings and metadata while enabling powerful hybrid search capabilities.


Step 1: Setup and Configuration

In this section, we'll set up the necessary environment and configurations to build a semantic video search application using TwelveLabs' Embed API and Vespa Cloud. Let's go through the setup process step by step.


Prerequisites

Before we begin, ensure you have:


Environment Setup

First, let's install the required Python packages:

!pip3 install pyvespa vespacli twelvelabs pandas

Now, let's import all the necessary packages:

import os
import hashlib
import json

from vespa.package import (
    ApplicationPackage,
    Field,
    Schema,
    Document,
    HNSW,
    RankProfile,
    FieldSet,
    SecondPhaseRanking,
    Function,
)

from vespa.deployment import VespaCloud
from vespa.io import VespaResponse, VespaQueryResponse

from twelvelabs import TwelveLabs
from twelvelabs.models.embed import EmbeddingsTask

import pandas as pd

from datetime import datetime


API Configuration

To use the TwelveLabs Embed API, you'll need to set up your API key:

  1. Sign up at https://auth.twelvelabs.io/u/signup if you haven't already

  2. Navigate to https://playground.twelvelabs.io/dashboard/api-key to get your API key

  3. Set up your API key:

TL_API_KEY = os.getenv("TL_API_KEY") or input("Enter your TL_API key: ")

Note: The Free plan includes 600 minutes of video indexing, which is sufficient for this tutorial.


Vespa Cloud Setup

To set up Vespa Cloud:

  1. Create a Vespa Cloud Trial account at https://vespa.ai/free-trial

  2. Log in to console.vespa-cloud.com and create a tenant

  3. Configure your application settings:

# Replace with your tenant name from the Vespa Cloud Console
tenant_name = "vespa-team"
# Replace with your application name (does not need to exist yet)
application = "videosearch"


Verification

Let's verify our setup by initializing the TwelveLabs client:

# Initialize Twelve Labs client
client = TwelveLabs(api_key=TL_API_KEY)

# Test the connection
try:
    client.tasks.list()
    print("Successfully connected to Twelve Labs API")
except Exception as e:
    print(f"Error connecting to Twelve Labs API: {e}")

With these configurations in place, we're ready to move on to generating embeddings for our sample videos in the next section.

Important: Keep your API keys secure and never commit them directly in your code. Consider using environment variables or secure secret management solutions in production environments.


Step 2: Generate Attributes and Embeddings

In this section, we'll generate multimodal embeddings and attributes for our sample videos using the Twelve Labs Generate API and Embed API. We'll process three videos from the Internet Archive to demonstrate the workflow.


Initialize Video Processing

First, let's set up our video sources and create an index:

VIDEO_URLs = [
    "https://ia801503.us.archive.org/27/items/hide-and-seek-with-giant-jenny/HnVideoEditor_2022_10_29_205557707.ia.mp4",
    "https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4",
    "https://dn720401.ca.archive.org/0/items/mr-bean-the-animated-series-holiday-for-teddy/S2E12.ia.mp4",
]

# Initialize client and create index
client = TwelveLabs(api_key=TL_API_KEY)
timestamp = int(datetime.now().timestamp())
index_name = "Vespa_" + str(timestamp)

# Create Index with Pegasus 1.2 model
index = client.index.create(
    name=index_name,
    models=[{"name": "pegasus1.2", "options": ["visual", "audio"]}],
    addons=["thumbnail"]
)


Upload and Process Videos

Now let's upload the videos and process them:

def on_task_update(task: EmbeddingsTask):
    print(f"  Status={task.status}")

for video_url in VIDEO_URLs:
    task = client.task.create(index_id=index.id, url=video_url, language="en")
    status = task.wait_for_done(sleep_interval=10, callback=on_task_update)
    if task.status != "ready":
        raise RuntimeError(f"Indexing failed with status {task.status}")


Generate Video Attributes

Generate summaries and keywords for each video:

summaries = []
keywords_array = []
titles = [
    "Mr. Bean the Animated Series Holiday for Teddy",
    "Twas the night before Christmas",
    "Hide and Seek with Giant Jenny",
]

videos = client.index.video.list(index_id)
for video in videos:
    # Generate summary
    res = client.generate.summarize(
        video_id=video.id,
        type="summary",
        prompt="Generate an abstract of the video serving as metadata on the video, up to five sentences."
    )
    summaries.append(res.summary)
    
    # Generate keywords
    keywords = client.generate.text(
        video_id=video.id,
        prompt="Based on this video, I want to generate five keywords for SEO. Provide just the keywords as a comma delimited list."
    )
    keywords_array.append(keywords.data)


Generate Multimodal Embeddings

Create embedding tasks using the Marengo retrieval model:

task_ids = []

for url in VIDEO_URLs:
    task = client.embed.task.create(model_name="Marengo-retrieval-2.7", video_url=url)
    task_ids.append(str(task.id))
    status = task.wait_for_done(sleep_interval=10, callback=on_task_update)
    if task.status != "ready":
        raise RuntimeError(f"Embedding failed with status {task.status}")


Retrieve and Process Embeddings

Finally, we'll retrieve the generated embeddings:

tasks = []
for task_id in task_ids:
    task = client.embed.task.retrieve(task_id)
    tasks.append(task)

The embeddings output has the following characteristics:

  • Each video is segmented into 6-second chunks

  • Each segment contains a 1024-dimensional embedding vector

  • Videos generate between 37 to 242 segments depending on length

  • Each segment includes start and end offset timestamps

  • The embedding scope is set to "clip" for temporal context

These embeddings capture the multimodal aspects of the videos, including visual elements, audio, and temporal relationships, which we'll use to enable semantic search in Vespa.


Step 3: Deploy Vespa Application

In this step, we will deploy the Vespa application package to Vespa Cloud and feed the processed video data, including attributes and embeddings, into the application.


Create an Application Package

The Vespa application package defines the schema and configuration for the application. It includes the fields generated by the Twelve Labs Embed API (e.g., video attributes and embeddings). Here's how to define the schema using pyvespa:

# Define the schema for "videos"
videos_schema = Schema(
    name="videos",
    document=Document(
        fields=[
            Field(name="video_url", type="string", indexing=["summary"]),
            Field(name="title", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25"),
            Field(name="keywords", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25"),
            Field(name="video_summary", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25"),
            Field(name="embedding_scope", type="string", indexing=["attribute", "summary"]),
            Field(name="start_offset_sec", type="array<float>", indexing=["attribute", "summary"]),
            Field(name="end_offset_sec", type="array<float>", indexing=["attribute", "summary"]),
            Field(
                name="embeddings",
                type="tensor<float>(p{},x[1024])",
                indexing=["index", "attribute"],
                ann=HNSW(distance_metric="angular"),
            ),
        ]
    ),
)

# Add fieldsets for search
fieldsets = [
    FieldSet(name="default", fields=["title", "keywords", "video_summary"]),
]

# Define ranking functions
mapfunctions = [
    Function(
        name="similarities",
        expression="""sum(query(q) * attribute(embeddings), x)""",
    ),
    Function(
        name="bm25_score",
        expression="bm25(title) + bm25(keywords) + bm25(video_summary)",
    ),
]

# Define a hybrid rank profile
semantic_rankprofile = RankProfile(
    name="hybrid",
    inputs=[("query(q)", "tensor<float>(x[1024])")],
    first_phase="bm25_score",
    second_phase=SecondPhaseRanking(expression="closeness(field, embeddings)", rerank_count=10),
    match_features=["closest(embeddings)"],
    summary_features=["similarities"],
    functions=mapfunctions,
)

# Add rank profile to schema
videos_schema.add_rank_profile(semantic_rankprofile)

# Create the application package
package = ApplicationPackage(name=application, schema=[videos_schema])

This schema defines how video data (attributes and embeddings) will be stored, indexed, and queried in Vespa.


Deploy the Application Package

Deploying the application package to Vespa Cloud involves creating a connection using VespaCloud and deploying the defined package.

from vespa.deployment import VespaCloud

# Deploy the application package to Vespa Cloud
vespa_cloud = VespaCloud(
    tenant=tenant_name,
    application=application,
    application_package=package,
    key_content=os.getenv("VESPA_TEAM_API_KEY"),  # Replace with your API key content
)

app = vespa_cloud.deploy()
print("Deployment complete!")

Once deployed, you will see deployment logs similar to those in the above image. The application is now live and ready to receive data.


Feed Data into Vespa

We now feed the video data (attributes and embeddings) into Vespa using pyvespa. Each document corresponds to a video segment, identified by a unique ID derived from its URL and segment index.

Prepare Data for Feeding

import hashlib

# Initialize a list to store Vespa feed documents
vespa_feed = []

# Reverse VIDEO_URLs since attributes were generated in reverse order
VIDEO_URLs.reverse()

for i, task in enumerate(tasks):
    video_url = VIDEO_URLs[i]
    title = titles[i]
    keywords = keywords_array[i]
    summary = summaries[i]

    start_offsets = []
    end_offsets = []
    embeddings = {}

    for index, segment in enumerate(task.video_embedding.segments):
        start_offsets.append(float(segment.start_offset_sec))
        end_offsets.append(float(segment.end_offset_sec))
        embeddings[str(index)] = list(map(float, segment.embeddings_float))

        # Create unique ID for each segment
        id_hash = hashlib.md5(f"{video_url}_{index}".encode()).hexdigest()

        document = {
            "id": id_hash,
            "fields": {
                "video_url": video_url,
                "title": title,
                "keywords": keywords,
                "video_summary": summary,
                "embedding_scope": segment.embedding_scope,
                "start_offset_sec": start_offsets,
                "end_offset_sec": end_offsets,
                "embeddings": embeddings,
            },
        }
        vespa_feed.append(document)

Feed Data into Vespa

from vespa.io import VespaResponse

def callback(response: VespaResponse, id: str):
    if not response.is_successful():
        print(f"Failed to feed document {id} with status code {response.status_code}: {response.get_json()}")

# Feed data synchronously
app.feed_iterable(vespa_feed, schema="videos", callback=callback)
print("Data feeding complete!")

The feed operation ensures that all video data is indexed in Vespa's content clusters. You can validate this by checking logs or querying your Vespa instance.

Summary

  • Application Package: Defined a schema for storing video attributes and embeddings.

  • Deployment: Deployed the package to Vespa Cloud.

  • Data Feeding: Indexed processed video data into the deployed application.

Your Vespa application is now ready for semantic search operations!


Step 4: Perform Semantic Search

In this step, we will perform a semantic search on the video embeddings stored in Vespa. The search will use a hybrid ranking approach, combining lexical and vector-based similarity to retrieve relevant video segments.

Performing a Hybrid Search

Generate Query Embedding

To search for video segments that match the query "Santa Claus on his sleigh," we first generate an embedding for the query using the same model (Marengo-retrieval-2.7) used for video embeddings:

client = TwelveLabs(api_key=TL_API_KEY)
user_query = "Santa Claus on his sleigh"

# Generate embedding for the query
res = client.embed.create(
    model_name="Marengo-retrieval-2.7",
    text=user_query,
)

print("Created a text embedding")
print(f" Model: {res.model_name}")
if res.text_embedding is not None and res.text_embedding.segments is not None:
    q_embedding = res.text_embedding.segments[0].embeddings_float
    print(f" Embedding Dimension: {len(q_embedding)}")
    print(f" Sample 5 values from array: {q_embedding[:5]}")

This step outputs a 1024-dimensional query embedding, which will be used to perform a nearest-neighbor search in Vespa.


Execute Hybrid Search in Vespa

Using Vespa's approximate nearest neighbor (ANN) search capabilities, we combine lexical search (BM25) with vector similarity ranking. The query retrieves the top hit based on hybrid ranking:

with app.syncio(connections=1) as session:
    response: VespaQueryResponse = session.query(
        yql="select * from videos where userQuery() OR ({targetHits:100}nearestNeighbor(embeddings,q))",
        query=user_query,
        ranking="hybrid",
        hits=1,
        body={"input.query(q)": q_embedding},
    )
    assert response.is_successful()

# Print the top hit
for hit in response.hits:
    print(json.dumps(hit, indent=4))

# Get full response JSON
response.get_json()

The hybrid ranking works as follows:

  1. First Phase: BM25 lexical ranking based on video titles, keywords, and summaries.

  2. Second Phase: Reranking using vector similarity between the query embedding and video embeddings.

The top hit corresponds to the segment with the highest combined score, as shown in the output logs.


Process Results into a DataFrame

To make results more consumable, we extract and sort the top NN segments based on their similarity scores using a pandas DataFrame:

def get_top_n_similarity_matches(data, N=5):
    """
    Extracts top N similarity scores and their corresponding offsets.

    Args:
    - data (dict): Input JSON-like structure containing similarities and offsets.
    - N (int): Number of top similarity scores to return.

    Returns:
    - pd.DataFrame: A DataFrame with top N similarity scores and offsets.
    """
    # Extract relevant fields
    similarities = data["fields"]["summaryfeatures"]["similarities"]["cells"]
    start_offset_sec = data["fields"]["start_offset_sec"]
    end_offset_sec = data["fields"]["end_offset_sec"]

    # Sort by similarity score
    sorted_similarities = sorted(similarities.items(), key=lambda x: x[1], reverse=True)

    # Prepare results for top N matches
    results = []
    for index_str, score in sorted_similarities[:N]:
        index = int(index_str)
        if index < len(start_offset_sec):
            results.append({
                "index": index,
                "similarity_score": score,
                "start_offset_sec": start_offset_sec[index],
                "end_offset_sec": end_offset_sec[index],
            })

    return pd.DataFrame(results)

# Get top 10 matches
df_result = get_top_n_similarity_matches(response.hits[0], N=10)
print(df_result)

The resulting DataFrame includes:

  • Index: Segment index within the video.

  • Similarity Score: Relevance score based on vector similarity.

  • Start/End Offsets: Temporal boundaries of the segment in seconds.

The table below shows an example of the top 10 matches obtained from the query:

This format makes it easy to identify the most relevant video segments for further analysis or playback.

Summary

  • We generated a query embedding using TwelveLabs' Embed API.

  • A hybrid search was performed in Vespa, combining lexical BM25 ranking with vector similarity.

  • The results were processed into a pandas DataFrame for easy analysis of top matches.

This workflow demonstrates how to efficiently retrieve semantically relevant video segments using TwelveLabs' multimodal embeddings and Vespa's advanced search capabilities.


Step 5: Review the Results

In this step, we will review the results of the semantic search by identifying and consolidating the relevant video segments. We will then use a video player embedded in the notebook to visually inspect these segments and evaluate their relevance.


Consolidate Contiguous Segments

To make it easier to review the results, we need to group contiguous video segments (based on their start and end offsets) into consolidated ranges. Additionally, we add a 3-second overlap at the boundaries of each segment for smoother transitions. The following function performs this operation and converts the time offsets into MM:SS format for readability:

For example, if the top matches include overlapping or adjacent time ranges, this function will merge them into a single range. The output might look like this:

Consolidated Segments (MM:SS): [('20:15', '20:27'), ('20:39', '21:21'), ('22:51', '23:15'
Visualize Results Using a Video Player

To inspect the identified segments, we embed a video player in the notebook. You can manually navigate to the consolidated time ranges or programmatically set the playback time to jump directly to specific segments.


Embed a Video Player
from IPython.display import HTML

# Define video URL (replace with your video URL)
video_url = "https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4"

# Define middle point of a segment for playback preview (e.g., first consolidated segment)
middle_point = (1272 + 1278) / 2  # Example from top match

# Generate HTML for video player
video_player = f"""
<video id="myVideo" width="640" height="480" controls>
  <source src="{video_url}" type="video/mp4">
  Your browser does not support the video tag.
</video
"""

# Display the video player
HTML(video_player)

This code generates an interactive video player that starts playback at a specified time (middle_point). You can update middle_point dynamically based on any segment of interest.


Evaluate Results

Once the player is embedded, you can visually inspect each consolidated segment by navigating through its time ranges. Use the segments list generated earlier to locate specific moments in the video that match your query.

For example:

  • Segment'20:21' - '21:18'

  • Observation: This segment shows Santa Claus on his sleigh, confirming that our hybrid search successfully identified relevant content.


Conclusion

Integrating TwelveLabs' Embed API with Vespa unlocks the potential for advanced semantic video search applications. The Embed API provides rich multimodal embeddings that capture the temporal and contextual nuances of video content, while Vespa's robust indexing and hybrid ranking capabilities enable efficient retrieval of relevant segments.

Key benefits of this integration include:

  • Multimodal Understanding: Unified embeddings from TwelveLabs ensure a comprehensive representation of video content across modalities.

  • Scalability: Vespa handles large datasets with ease, supporting low-latency queries even at scale.

  • Hybrid Search: Combining lexical (BM25) and semantic (ANN-based) search ensures precise retrieval of relevant results.

  • Flexibility: Developers can customize schemas, rank profiles, and query logic to fit specific use cases.

By following this tutorial, developers can build scalable and intelligent video search systems that enhance user experiences and unlock new possibilities in content discovery.

Introduction

In the era of multimodal content, extracting meaningful insights from video data requires sophisticated tools that can process and interpret multiple modalities such as text, audio, and visuals. TwelveLabs' Embed API empowers developers to generate rich, multimodal embeddings that encapsulate the essence of video content, including visual expressions, spoken words, and contextual interactions. These embeddings enable advanced applications like semantic video search by providing a unified vector representation of videos.

On the other hand, Vespa, a platform designed for low-latency computation over large datasets, excels in indexing and querying structured and vector data. With its support for approximate nearest neighbor (ANN) search and hybrid ranking capabilities, Vespa is an ideal partner for deploying scalable video search solutions.

This tutorial demonstrates how to integrate TwelveLabs' Embed API with Vespa to build a semantic video search application. By combining the strengths of both platforms, we can efficiently index video embeddings and metadata while enabling powerful hybrid search capabilities.


Step 1: Setup and Configuration

In this section, we'll set up the necessary environment and configurations to build a semantic video search application using TwelveLabs' Embed API and Vespa Cloud. Let's go through the setup process step by step.


Prerequisites

Before we begin, ensure you have:


Environment Setup

First, let's install the required Python packages:

!pip3 install pyvespa vespacli twelvelabs pandas

Now, let's import all the necessary packages:

import os
import hashlib
import json

from vespa.package import (
    ApplicationPackage,
    Field,
    Schema,
    Document,
    HNSW,
    RankProfile,
    FieldSet,
    SecondPhaseRanking,
    Function,
)

from vespa.deployment import VespaCloud
from vespa.io import VespaResponse, VespaQueryResponse

from twelvelabs import TwelveLabs
from twelvelabs.models.embed import EmbeddingsTask

import pandas as pd

from datetime import datetime


API Configuration

To use the TwelveLabs Embed API, you'll need to set up your API key:

  1. Sign up at https://auth.twelvelabs.io/u/signup if you haven't already

  2. Navigate to https://playground.twelvelabs.io/dashboard/api-key to get your API key

  3. Set up your API key:

TL_API_KEY = os.getenv("TL_API_KEY") or input("Enter your TL_API key: ")

Note: The Free plan includes 600 minutes of video indexing, which is sufficient for this tutorial.


Vespa Cloud Setup

To set up Vespa Cloud:

  1. Create a Vespa Cloud Trial account at https://vespa.ai/free-trial

  2. Log in to console.vespa-cloud.com and create a tenant

  3. Configure your application settings:

# Replace with your tenant name from the Vespa Cloud Console
tenant_name = "vespa-team"
# Replace with your application name (does not need to exist yet)
application = "videosearch"


Verification

Let's verify our setup by initializing the TwelveLabs client:

# Initialize Twelve Labs client
client = TwelveLabs(api_key=TL_API_KEY)

# Test the connection
try:
    client.tasks.list()
    print("Successfully connected to Twelve Labs API")
except Exception as e:
    print(f"Error connecting to Twelve Labs API: {e}")

With these configurations in place, we're ready to move on to generating embeddings for our sample videos in the next section.

Important: Keep your API keys secure and never commit them directly in your code. Consider using environment variables or secure secret management solutions in production environments.


Step 2: Generate Attributes and Embeddings

In this section, we'll generate multimodal embeddings and attributes for our sample videos using the Twelve Labs Generate API and Embed API. We'll process three videos from the Internet Archive to demonstrate the workflow.


Initialize Video Processing

First, let's set up our video sources and create an index:

VIDEO_URLs = [
    "https://ia801503.us.archive.org/27/items/hide-and-seek-with-giant-jenny/HnVideoEditor_2022_10_29_205557707.ia.mp4",
    "https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4",
    "https://dn720401.ca.archive.org/0/items/mr-bean-the-animated-series-holiday-for-teddy/S2E12.ia.mp4",
]

# Initialize client and create index
client = TwelveLabs(api_key=TL_API_KEY)
timestamp = int(datetime.now().timestamp())
index_name = "Vespa_" + str(timestamp)

# Create Index with Pegasus 1.2 model
index = client.index.create(
    name=index_name,
    models=[{"name": "pegasus1.2", "options": ["visual", "audio"]}],
    addons=["thumbnail"]
)


Upload and Process Videos

Now let's upload the videos and process them:

def on_task_update(task: EmbeddingsTask):
    print(f"  Status={task.status}")

for video_url in VIDEO_URLs:
    task = client.task.create(index_id=index.id, url=video_url, language="en")
    status = task.wait_for_done(sleep_interval=10, callback=on_task_update)
    if task.status != "ready":
        raise RuntimeError(f"Indexing failed with status {task.status}")


Generate Video Attributes

Generate summaries and keywords for each video:

summaries = []
keywords_array = []
titles = [
    "Mr. Bean the Animated Series Holiday for Teddy",
    "Twas the night before Christmas",
    "Hide and Seek with Giant Jenny",
]

videos = client.index.video.list(index_id)
for video in videos:
    # Generate summary
    res = client.generate.summarize(
        video_id=video.id,
        type="summary",
        prompt="Generate an abstract of the video serving as metadata on the video, up to five sentences."
    )
    summaries.append(res.summary)
    
    # Generate keywords
    keywords = client.generate.text(
        video_id=video.id,
        prompt="Based on this video, I want to generate five keywords for SEO. Provide just the keywords as a comma delimited list."
    )
    keywords_array.append(keywords.data)


Generate Multimodal Embeddings

Create embedding tasks using the Marengo retrieval model:

task_ids = []

for url in VIDEO_URLs:
    task = client.embed.task.create(model_name="Marengo-retrieval-2.7", video_url=url)
    task_ids.append(str(task.id))
    status = task.wait_for_done(sleep_interval=10, callback=on_task_update)
    if task.status != "ready":
        raise RuntimeError(f"Embedding failed with status {task.status}")


Retrieve and Process Embeddings

Finally, we'll retrieve the generated embeddings:

tasks = []
for task_id in task_ids:
    task = client.embed.task.retrieve(task_id)
    tasks.append(task)

The embeddings output has the following characteristics:

  • Each video is segmented into 6-second chunks

  • Each segment contains a 1024-dimensional embedding vector

  • Videos generate between 37 to 242 segments depending on length

  • Each segment includes start and end offset timestamps

  • The embedding scope is set to "clip" for temporal context

These embeddings capture the multimodal aspects of the videos, including visual elements, audio, and temporal relationships, which we'll use to enable semantic search in Vespa.


Step 3: Deploy Vespa Application

In this step, we will deploy the Vespa application package to Vespa Cloud and feed the processed video data, including attributes and embeddings, into the application.


Create an Application Package

The Vespa application package defines the schema and configuration for the application. It includes the fields generated by the Twelve Labs Embed API (e.g., video attributes and embeddings). Here's how to define the schema using pyvespa:

# Define the schema for "videos"
videos_schema = Schema(
    name="videos",
    document=Document(
        fields=[
            Field(name="video_url", type="string", indexing=["summary"]),
            Field(name="title", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25"),
            Field(name="keywords", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25"),
            Field(name="video_summary", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25"),
            Field(name="embedding_scope", type="string", indexing=["attribute", "summary"]),
            Field(name="start_offset_sec", type="array<float>", indexing=["attribute", "summary"]),
            Field(name="end_offset_sec", type="array<float>", indexing=["attribute", "summary"]),
            Field(
                name="embeddings",
                type="tensor<float>(p{},x[1024])",
                indexing=["index", "attribute"],
                ann=HNSW(distance_metric="angular"),
            ),
        ]
    ),
)

# Add fieldsets for search
fieldsets = [
    FieldSet(name="default", fields=["title", "keywords", "video_summary"]),
]

# Define ranking functions
mapfunctions = [
    Function(
        name="similarities",
        expression="""sum(query(q) * attribute(embeddings), x)""",
    ),
    Function(
        name="bm25_score",
        expression="bm25(title) + bm25(keywords) + bm25(video_summary)",
    ),
]

# Define a hybrid rank profile
semantic_rankprofile = RankProfile(
    name="hybrid",
    inputs=[("query(q)", "tensor<float>(x[1024])")],
    first_phase="bm25_score",
    second_phase=SecondPhaseRanking(expression="closeness(field, embeddings)", rerank_count=10),
    match_features=["closest(embeddings)"],
    summary_features=["similarities"],
    functions=mapfunctions,
)

# Add rank profile to schema
videos_schema.add_rank_profile(semantic_rankprofile)

# Create the application package
package = ApplicationPackage(name=application, schema=[videos_schema])

This schema defines how video data (attributes and embeddings) will be stored, indexed, and queried in Vespa.


Deploy the Application Package

Deploying the application package to Vespa Cloud involves creating a connection using VespaCloud and deploying the defined package.

from vespa.deployment import VespaCloud

# Deploy the application package to Vespa Cloud
vespa_cloud = VespaCloud(
    tenant=tenant_name,
    application=application,
    application_package=package,
    key_content=os.getenv("VESPA_TEAM_API_KEY"),  # Replace with your API key content
)

app = vespa_cloud.deploy()
print("Deployment complete!")

Once deployed, you will see deployment logs similar to those in the above image. The application is now live and ready to receive data.


Feed Data into Vespa

We now feed the video data (attributes and embeddings) into Vespa using pyvespa. Each document corresponds to a video segment, identified by a unique ID derived from its URL and segment index.

Prepare Data for Feeding

import hashlib

# Initialize a list to store Vespa feed documents
vespa_feed = []

# Reverse VIDEO_URLs since attributes were generated in reverse order
VIDEO_URLs.reverse()

for i, task in enumerate(tasks):
    video_url = VIDEO_URLs[i]
    title = titles[i]
    keywords = keywords_array[i]
    summary = summaries[i]

    start_offsets = []
    end_offsets = []
    embeddings = {}

    for index, segment in enumerate(task.video_embedding.segments):
        start_offsets.append(float(segment.start_offset_sec))
        end_offsets.append(float(segment.end_offset_sec))
        embeddings[str(index)] = list(map(float, segment.embeddings_float))

        # Create unique ID for each segment
        id_hash = hashlib.md5(f"{video_url}_{index}".encode()).hexdigest()

        document = {
            "id": id_hash,
            "fields": {
                "video_url": video_url,
                "title": title,
                "keywords": keywords,
                "video_summary": summary,
                "embedding_scope": segment.embedding_scope,
                "start_offset_sec": start_offsets,
                "end_offset_sec": end_offsets,
                "embeddings": embeddings,
            },
        }
        vespa_feed.append(document)

Feed Data into Vespa

from vespa.io import VespaResponse

def callback(response: VespaResponse, id: str):
    if not response.is_successful():
        print(f"Failed to feed document {id} with status code {response.status_code}: {response.get_json()}")

# Feed data synchronously
app.feed_iterable(vespa_feed, schema="videos", callback=callback)
print("Data feeding complete!")

The feed operation ensures that all video data is indexed in Vespa's content clusters. You can validate this by checking logs or querying your Vespa instance.

Summary

  • Application Package: Defined a schema for storing video attributes and embeddings.

  • Deployment: Deployed the package to Vespa Cloud.

  • Data Feeding: Indexed processed video data into the deployed application.

Your Vespa application is now ready for semantic search operations!


Step 4: Perform Semantic Search

In this step, we will perform a semantic search on the video embeddings stored in Vespa. The search will use a hybrid ranking approach, combining lexical and vector-based similarity to retrieve relevant video segments.

Performing a Hybrid Search

Generate Query Embedding

To search for video segments that match the query "Santa Claus on his sleigh," we first generate an embedding for the query using the same model (Marengo-retrieval-2.7) used for video embeddings:

client = TwelveLabs(api_key=TL_API_KEY)
user_query = "Santa Claus on his sleigh"

# Generate embedding for the query
res = client.embed.create(
    model_name="Marengo-retrieval-2.7",
    text=user_query,
)

print("Created a text embedding")
print(f" Model: {res.model_name}")
if res.text_embedding is not None and res.text_embedding.segments is not None:
    q_embedding = res.text_embedding.segments[0].embeddings_float
    print(f" Embedding Dimension: {len(q_embedding)}")
    print(f" Sample 5 values from array: {q_embedding[:5]}")

This step outputs a 1024-dimensional query embedding, which will be used to perform a nearest-neighbor search in Vespa.


Execute Hybrid Search in Vespa

Using Vespa's approximate nearest neighbor (ANN) search capabilities, we combine lexical search (BM25) with vector similarity ranking. The query retrieves the top hit based on hybrid ranking:

with app.syncio(connections=1) as session:
    response: VespaQueryResponse = session.query(
        yql="select * from videos where userQuery() OR ({targetHits:100}nearestNeighbor(embeddings,q))",
        query=user_query,
        ranking="hybrid",
        hits=1,
        body={"input.query(q)": q_embedding},
    )
    assert response.is_successful()

# Print the top hit
for hit in response.hits:
    print(json.dumps(hit, indent=4))

# Get full response JSON
response.get_json()

The hybrid ranking works as follows:

  1. First Phase: BM25 lexical ranking based on video titles, keywords, and summaries.

  2. Second Phase: Reranking using vector similarity between the query embedding and video embeddings.

The top hit corresponds to the segment with the highest combined score, as shown in the output logs.


Process Results into a DataFrame

To make results more consumable, we extract and sort the top NN segments based on their similarity scores using a pandas DataFrame:

def get_top_n_similarity_matches(data, N=5):
    """
    Extracts top N similarity scores and their corresponding offsets.

    Args:
    - data (dict): Input JSON-like structure containing similarities and offsets.
    - N (int): Number of top similarity scores to return.

    Returns:
    - pd.DataFrame: A DataFrame with top N similarity scores and offsets.
    """
    # Extract relevant fields
    similarities = data["fields"]["summaryfeatures"]["similarities"]["cells"]
    start_offset_sec = data["fields"]["start_offset_sec"]
    end_offset_sec = data["fields"]["end_offset_sec"]

    # Sort by similarity score
    sorted_similarities = sorted(similarities.items(), key=lambda x: x[1], reverse=True)

    # Prepare results for top N matches
    results = []
    for index_str, score in sorted_similarities[:N]:
        index = int(index_str)
        if index < len(start_offset_sec):
            results.append({
                "index": index,
                "similarity_score": score,
                "start_offset_sec": start_offset_sec[index],
                "end_offset_sec": end_offset_sec[index],
            })

    return pd.DataFrame(results)

# Get top 10 matches
df_result = get_top_n_similarity_matches(response.hits[0], N=10)
print(df_result)

The resulting DataFrame includes:

  • Index: Segment index within the video.

  • Similarity Score: Relevance score based on vector similarity.

  • Start/End Offsets: Temporal boundaries of the segment in seconds.

The table below shows an example of the top 10 matches obtained from the query:

This format makes it easy to identify the most relevant video segments for further analysis or playback.

Summary

  • We generated a query embedding using TwelveLabs' Embed API.

  • A hybrid search was performed in Vespa, combining lexical BM25 ranking with vector similarity.

  • The results were processed into a pandas DataFrame for easy analysis of top matches.

This workflow demonstrates how to efficiently retrieve semantically relevant video segments using TwelveLabs' multimodal embeddings and Vespa's advanced search capabilities.


Step 5: Review the Results

In this step, we will review the results of the semantic search by identifying and consolidating the relevant video segments. We will then use a video player embedded in the notebook to visually inspect these segments and evaluate their relevance.


Consolidate Contiguous Segments

To make it easier to review the results, we need to group contiguous video segments (based on their start and end offsets) into consolidated ranges. Additionally, we add a 3-second overlap at the boundaries of each segment for smoother transitions. The following function performs this operation and converts the time offsets into MM:SS format for readability:

For example, if the top matches include overlapping or adjacent time ranges, this function will merge them into a single range. The output might look like this:

Consolidated Segments (MM:SS): [('20:15', '20:27'), ('20:39', '21:21'), ('22:51', '23:15'
Visualize Results Using a Video Player

To inspect the identified segments, we embed a video player in the notebook. You can manually navigate to the consolidated time ranges or programmatically set the playback time to jump directly to specific segments.


Embed a Video Player
from IPython.display import HTML

# Define video URL (replace with your video URL)
video_url = "https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4"

# Define middle point of a segment for playback preview (e.g., first consolidated segment)
middle_point = (1272 + 1278) / 2  # Example from top match

# Generate HTML for video player
video_player = f"""
<video id="myVideo" width="640" height="480" controls>
  <source src="{video_url}" type="video/mp4">
  Your browser does not support the video tag.
</video
"""

# Display the video player
HTML(video_player)

This code generates an interactive video player that starts playback at a specified time (middle_point). You can update middle_point dynamically based on any segment of interest.


Evaluate Results

Once the player is embedded, you can visually inspect each consolidated segment by navigating through its time ranges. Use the segments list generated earlier to locate specific moments in the video that match your query.

For example:

  • Segment'20:21' - '21:18'

  • Observation: This segment shows Santa Claus on his sleigh, confirming that our hybrid search successfully identified relevant content.


Conclusion

Integrating TwelveLabs' Embed API with Vespa unlocks the potential for advanced semantic video search applications. The Embed API provides rich multimodal embeddings that capture the temporal and contextual nuances of video content, while Vespa's robust indexing and hybrid ranking capabilities enable efficient retrieval of relevant segments.

Key benefits of this integration include:

  • Multimodal Understanding: Unified embeddings from TwelveLabs ensure a comprehensive representation of video content across modalities.

  • Scalability: Vespa handles large datasets with ease, supporting low-latency queries even at scale.

  • Hybrid Search: Combining lexical (BM25) and semantic (ANN-based) search ensures precise retrieval of relevant results.

  • Flexibility: Developers can customize schemas, rank profiles, and query logic to fit specific use cases.

By following this tutorial, developers can build scalable and intelligent video search systems that enhance user experiences and unlock new possibilities in content discovery.

Introduction

In the era of multimodal content, extracting meaningful insights from video data requires sophisticated tools that can process and interpret multiple modalities such as text, audio, and visuals. TwelveLabs' Embed API empowers developers to generate rich, multimodal embeddings that encapsulate the essence of video content, including visual expressions, spoken words, and contextual interactions. These embeddings enable advanced applications like semantic video search by providing a unified vector representation of videos.

On the other hand, Vespa, a platform designed for low-latency computation over large datasets, excels in indexing and querying structured and vector data. With its support for approximate nearest neighbor (ANN) search and hybrid ranking capabilities, Vespa is an ideal partner for deploying scalable video search solutions.

This tutorial demonstrates how to integrate TwelveLabs' Embed API with Vespa to build a semantic video search application. By combining the strengths of both platforms, we can efficiently index video embeddings and metadata while enabling powerful hybrid search capabilities.


Step 1: Setup and Configuration

In this section, we'll set up the necessary environment and configurations to build a semantic video search application using TwelveLabs' Embed API and Vespa Cloud. Let's go through the setup process step by step.


Prerequisites

Before we begin, ensure you have:


Environment Setup

First, let's install the required Python packages:

!pip3 install pyvespa vespacli twelvelabs pandas

Now, let's import all the necessary packages:

import os
import hashlib
import json

from vespa.package import (
    ApplicationPackage,
    Field,
    Schema,
    Document,
    HNSW,
    RankProfile,
    FieldSet,
    SecondPhaseRanking,
    Function,
)

from vespa.deployment import VespaCloud
from vespa.io import VespaResponse, VespaQueryResponse

from twelvelabs import TwelveLabs
from twelvelabs.models.embed import EmbeddingsTask

import pandas as pd

from datetime import datetime


API Configuration

To use the TwelveLabs Embed API, you'll need to set up your API key:

  1. Sign up at https://auth.twelvelabs.io/u/signup if you haven't already

  2. Navigate to https://playground.twelvelabs.io/dashboard/api-key to get your API key

  3. Set up your API key:

TL_API_KEY = os.getenv("TL_API_KEY") or input("Enter your TL_API key: ")

Note: The Free plan includes 600 minutes of video indexing, which is sufficient for this tutorial.


Vespa Cloud Setup

To set up Vespa Cloud:

  1. Create a Vespa Cloud Trial account at https://vespa.ai/free-trial

  2. Log in to console.vespa-cloud.com and create a tenant

  3. Configure your application settings:

# Replace with your tenant name from the Vespa Cloud Console
tenant_name = "vespa-team"
# Replace with your application name (does not need to exist yet)
application = "videosearch"


Verification

Let's verify our setup by initializing the TwelveLabs client:

# Initialize Twelve Labs client
client = TwelveLabs(api_key=TL_API_KEY)

# Test the connection
try:
    client.tasks.list()
    print("Successfully connected to Twelve Labs API")
except Exception as e:
    print(f"Error connecting to Twelve Labs API: {e}")

With these configurations in place, we're ready to move on to generating embeddings for our sample videos in the next section.

Important: Keep your API keys secure and never commit them directly in your code. Consider using environment variables or secure secret management solutions in production environments.


Step 2: Generate Attributes and Embeddings

In this section, we'll generate multimodal embeddings and attributes for our sample videos using the Twelve Labs Generate API and Embed API. We'll process three videos from the Internet Archive to demonstrate the workflow.


Initialize Video Processing

First, let's set up our video sources and create an index:

VIDEO_URLs = [
    "https://ia801503.us.archive.org/27/items/hide-and-seek-with-giant-jenny/HnVideoEditor_2022_10_29_205557707.ia.mp4",
    "https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4",
    "https://dn720401.ca.archive.org/0/items/mr-bean-the-animated-series-holiday-for-teddy/S2E12.ia.mp4",
]

# Initialize client and create index
client = TwelveLabs(api_key=TL_API_KEY)
timestamp = int(datetime.now().timestamp())
index_name = "Vespa_" + str(timestamp)

# Create Index with Pegasus 1.2 model
index = client.index.create(
    name=index_name,
    models=[{"name": "pegasus1.2", "options": ["visual", "audio"]}],
    addons=["thumbnail"]
)


Upload and Process Videos

Now let's upload the videos and process them:

def on_task_update(task: EmbeddingsTask):
    print(f"  Status={task.status}")

for video_url in VIDEO_URLs:
    task = client.task.create(index_id=index.id, url=video_url, language="en")
    status = task.wait_for_done(sleep_interval=10, callback=on_task_update)
    if task.status != "ready":
        raise RuntimeError(f"Indexing failed with status {task.status}")


Generate Video Attributes

Generate summaries and keywords for each video:

summaries = []
keywords_array = []
titles = [
    "Mr. Bean the Animated Series Holiday for Teddy",
    "Twas the night before Christmas",
    "Hide and Seek with Giant Jenny",
]

videos = client.index.video.list(index_id)
for video in videos:
    # Generate summary
    res = client.generate.summarize(
        video_id=video.id,
        type="summary",
        prompt="Generate an abstract of the video serving as metadata on the video, up to five sentences."
    )
    summaries.append(res.summary)
    
    # Generate keywords
    keywords = client.generate.text(
        video_id=video.id,
        prompt="Based on this video, I want to generate five keywords for SEO. Provide just the keywords as a comma delimited list."
    )
    keywords_array.append(keywords.data)


Generate Multimodal Embeddings

Create embedding tasks using the Marengo retrieval model:

task_ids = []

for url in VIDEO_URLs:
    task = client.embed.task.create(model_name="Marengo-retrieval-2.7", video_url=url)
    task_ids.append(str(task.id))
    status = task.wait_for_done(sleep_interval=10, callback=on_task_update)
    if task.status != "ready":
        raise RuntimeError(f"Embedding failed with status {task.status}")


Retrieve and Process Embeddings

Finally, we'll retrieve the generated embeddings:

tasks = []
for task_id in task_ids:
    task = client.embed.task.retrieve(task_id)
    tasks.append(task)

The embeddings output has the following characteristics:

  • Each video is segmented into 6-second chunks

  • Each segment contains a 1024-dimensional embedding vector

  • Videos generate between 37 to 242 segments depending on length

  • Each segment includes start and end offset timestamps

  • The embedding scope is set to "clip" for temporal context

These embeddings capture the multimodal aspects of the videos, including visual elements, audio, and temporal relationships, which we'll use to enable semantic search in Vespa.


Step 3: Deploy Vespa Application

In this step, we will deploy the Vespa application package to Vespa Cloud and feed the processed video data, including attributes and embeddings, into the application.


Create an Application Package

The Vespa application package defines the schema and configuration for the application. It includes the fields generated by the Twelve Labs Embed API (e.g., video attributes and embeddings). Here's how to define the schema using pyvespa:

# Define the schema for "videos"
videos_schema = Schema(
    name="videos",
    document=Document(
        fields=[
            Field(name="video_url", type="string", indexing=["summary"]),
            Field(name="title", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25"),
            Field(name="keywords", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25"),
            Field(name="video_summary", type="string", indexing=["index", "summary"], match=["text"], index="enable-bm25"),
            Field(name="embedding_scope", type="string", indexing=["attribute", "summary"]),
            Field(name="start_offset_sec", type="array<float>", indexing=["attribute", "summary"]),
            Field(name="end_offset_sec", type="array<float>", indexing=["attribute", "summary"]),
            Field(
                name="embeddings",
                type="tensor<float>(p{},x[1024])",
                indexing=["index", "attribute"],
                ann=HNSW(distance_metric="angular"),
            ),
        ]
    ),
)

# Add fieldsets for search
fieldsets = [
    FieldSet(name="default", fields=["title", "keywords", "video_summary"]),
]

# Define ranking functions
mapfunctions = [
    Function(
        name="similarities",
        expression="""sum(query(q) * attribute(embeddings), x)""",
    ),
    Function(
        name="bm25_score",
        expression="bm25(title) + bm25(keywords) + bm25(video_summary)",
    ),
]

# Define a hybrid rank profile
semantic_rankprofile = RankProfile(
    name="hybrid",
    inputs=[("query(q)", "tensor<float>(x[1024])")],
    first_phase="bm25_score",
    second_phase=SecondPhaseRanking(expression="closeness(field, embeddings)", rerank_count=10),
    match_features=["closest(embeddings)"],
    summary_features=["similarities"],
    functions=mapfunctions,
)

# Add rank profile to schema
videos_schema.add_rank_profile(semantic_rankprofile)

# Create the application package
package = ApplicationPackage(name=application, schema=[videos_schema])

This schema defines how video data (attributes and embeddings) will be stored, indexed, and queried in Vespa.


Deploy the Application Package

Deploying the application package to Vespa Cloud involves creating a connection using VespaCloud and deploying the defined package.

from vespa.deployment import VespaCloud

# Deploy the application package to Vespa Cloud
vespa_cloud = VespaCloud(
    tenant=tenant_name,
    application=application,
    application_package=package,
    key_content=os.getenv("VESPA_TEAM_API_KEY"),  # Replace with your API key content
)

app = vespa_cloud.deploy()
print("Deployment complete!")

Once deployed, you will see deployment logs similar to those in the above image. The application is now live and ready to receive data.


Feed Data into Vespa

We now feed the video data (attributes and embeddings) into Vespa using pyvespa. Each document corresponds to a video segment, identified by a unique ID derived from its URL and segment index.

Prepare Data for Feeding

import hashlib

# Initialize a list to store Vespa feed documents
vespa_feed = []

# Reverse VIDEO_URLs since attributes were generated in reverse order
VIDEO_URLs.reverse()

for i, task in enumerate(tasks):
    video_url = VIDEO_URLs[i]
    title = titles[i]
    keywords = keywords_array[i]
    summary = summaries[i]

    start_offsets = []
    end_offsets = []
    embeddings = {}

    for index, segment in enumerate(task.video_embedding.segments):
        start_offsets.append(float(segment.start_offset_sec))
        end_offsets.append(float(segment.end_offset_sec))
        embeddings[str(index)] = list(map(float, segment.embeddings_float))

        # Create unique ID for each segment
        id_hash = hashlib.md5(f"{video_url}_{index}".encode()).hexdigest()

        document = {
            "id": id_hash,
            "fields": {
                "video_url": video_url,
                "title": title,
                "keywords": keywords,
                "video_summary": summary,
                "embedding_scope": segment.embedding_scope,
                "start_offset_sec": start_offsets,
                "end_offset_sec": end_offsets,
                "embeddings": embeddings,
            },
        }
        vespa_feed.append(document)

Feed Data into Vespa

from vespa.io import VespaResponse

def callback(response: VespaResponse, id: str):
    if not response.is_successful():
        print(f"Failed to feed document {id} with status code {response.status_code}: {response.get_json()}")

# Feed data synchronously
app.feed_iterable(vespa_feed, schema="videos", callback=callback)
print("Data feeding complete!")

The feed operation ensures that all video data is indexed in Vespa's content clusters. You can validate this by checking logs or querying your Vespa instance.

Summary

  • Application Package: Defined a schema for storing video attributes and embeddings.

  • Deployment: Deployed the package to Vespa Cloud.

  • Data Feeding: Indexed processed video data into the deployed application.

Your Vespa application is now ready for semantic search operations!


Step 4: Perform Semantic Search

In this step, we will perform a semantic search on the video embeddings stored in Vespa. The search will use a hybrid ranking approach, combining lexical and vector-based similarity to retrieve relevant video segments.

Performing a Hybrid Search

Generate Query Embedding

To search for video segments that match the query "Santa Claus on his sleigh," we first generate an embedding for the query using the same model (Marengo-retrieval-2.7) used for video embeddings:

client = TwelveLabs(api_key=TL_API_KEY)
user_query = "Santa Claus on his sleigh"

# Generate embedding for the query
res = client.embed.create(
    model_name="Marengo-retrieval-2.7",
    text=user_query,
)

print("Created a text embedding")
print(f" Model: {res.model_name}")
if res.text_embedding is not None and res.text_embedding.segments is not None:
    q_embedding = res.text_embedding.segments[0].embeddings_float
    print(f" Embedding Dimension: {len(q_embedding)}")
    print(f" Sample 5 values from array: {q_embedding[:5]}")

This step outputs a 1024-dimensional query embedding, which will be used to perform a nearest-neighbor search in Vespa.


Execute Hybrid Search in Vespa

Using Vespa's approximate nearest neighbor (ANN) search capabilities, we combine lexical search (BM25) with vector similarity ranking. The query retrieves the top hit based on hybrid ranking:

with app.syncio(connections=1) as session:
    response: VespaQueryResponse = session.query(
        yql="select * from videos where userQuery() OR ({targetHits:100}nearestNeighbor(embeddings,q))",
        query=user_query,
        ranking="hybrid",
        hits=1,
        body={"input.query(q)": q_embedding},
    )
    assert response.is_successful()

# Print the top hit
for hit in response.hits:
    print(json.dumps(hit, indent=4))

# Get full response JSON
response.get_json()

The hybrid ranking works as follows:

  1. First Phase: BM25 lexical ranking based on video titles, keywords, and summaries.

  2. Second Phase: Reranking using vector similarity between the query embedding and video embeddings.

The top hit corresponds to the segment with the highest combined score, as shown in the output logs.


Process Results into a DataFrame

To make results more consumable, we extract and sort the top NN segments based on their similarity scores using a pandas DataFrame:

def get_top_n_similarity_matches(data, N=5):
    """
    Extracts top N similarity scores and their corresponding offsets.

    Args:
    - data (dict): Input JSON-like structure containing similarities and offsets.
    - N (int): Number of top similarity scores to return.

    Returns:
    - pd.DataFrame: A DataFrame with top N similarity scores and offsets.
    """
    # Extract relevant fields
    similarities = data["fields"]["summaryfeatures"]["similarities"]["cells"]
    start_offset_sec = data["fields"]["start_offset_sec"]
    end_offset_sec = data["fields"]["end_offset_sec"]

    # Sort by similarity score
    sorted_similarities = sorted(similarities.items(), key=lambda x: x[1], reverse=True)

    # Prepare results for top N matches
    results = []
    for index_str, score in sorted_similarities[:N]:
        index = int(index_str)
        if index < len(start_offset_sec):
            results.append({
                "index": index,
                "similarity_score": score,
                "start_offset_sec": start_offset_sec[index],
                "end_offset_sec": end_offset_sec[index],
            })

    return pd.DataFrame(results)

# Get top 10 matches
df_result = get_top_n_similarity_matches(response.hits[0], N=10)
print(df_result)

The resulting DataFrame includes:

  • Index: Segment index within the video.

  • Similarity Score: Relevance score based on vector similarity.

  • Start/End Offsets: Temporal boundaries of the segment in seconds.

The table below shows an example of the top 10 matches obtained from the query:

This format makes it easy to identify the most relevant video segments for further analysis or playback.

Summary

  • We generated a query embedding using TwelveLabs' Embed API.

  • A hybrid search was performed in Vespa, combining lexical BM25 ranking with vector similarity.

  • The results were processed into a pandas DataFrame for easy analysis of top matches.

This workflow demonstrates how to efficiently retrieve semantically relevant video segments using TwelveLabs' multimodal embeddings and Vespa's advanced search capabilities.


Step 5: Review the Results

In this step, we will review the results of the semantic search by identifying and consolidating the relevant video segments. We will then use a video player embedded in the notebook to visually inspect these segments and evaluate their relevance.


Consolidate Contiguous Segments

To make it easier to review the results, we need to group contiguous video segments (based on their start and end offsets) into consolidated ranges. Additionally, we add a 3-second overlap at the boundaries of each segment for smoother transitions. The following function performs this operation and converts the time offsets into MM:SS format for readability:

For example, if the top matches include overlapping or adjacent time ranges, this function will merge them into a single range. The output might look like this:

Consolidated Segments (MM:SS): [('20:15', '20:27'), ('20:39', '21:21'), ('22:51', '23:15'
Visualize Results Using a Video Player

To inspect the identified segments, we embed a video player in the notebook. You can manually navigate to the consolidated time ranges or programmatically set the playback time to jump directly to specific segments.


Embed a Video Player
from IPython.display import HTML

# Define video URL (replace with your video URL)
video_url = "https://ia601401.us.archive.org/1/items/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net/twas-the-night-before-christmas-1974-full-movie-freedownloadvideo.net.mp4"

# Define middle point of a segment for playback preview (e.g., first consolidated segment)
middle_point = (1272 + 1278) / 2  # Example from top match

# Generate HTML for video player
video_player = f"""
<video id="myVideo" width="640" height="480" controls>
  <source src="{video_url}" type="video/mp4">
  Your browser does not support the video tag.
</video
"""

# Display the video player
HTML(video_player)

This code generates an interactive video player that starts playback at a specified time (middle_point). You can update middle_point dynamically based on any segment of interest.


Evaluate Results

Once the player is embedded, you can visually inspect each consolidated segment by navigating through its time ranges. Use the segments list generated earlier to locate specific moments in the video that match your query.

For example:

  • Segment'20:21' - '21:18'

  • Observation: This segment shows Santa Claus on his sleigh, confirming that our hybrid search successfully identified relevant content.


Conclusion

Integrating TwelveLabs' Embed API with Vespa unlocks the potential for advanced semantic video search applications. The Embed API provides rich multimodal embeddings that capture the temporal and contextual nuances of video content, while Vespa's robust indexing and hybrid ranking capabilities enable efficient retrieval of relevant segments.

Key benefits of this integration include:

  • Multimodal Understanding: Unified embeddings from TwelveLabs ensure a comprehensive representation of video content across modalities.

  • Scalability: Vespa handles large datasets with ease, supporting low-latency queries even at scale.

  • Hybrid Search: Combining lexical (BM25) and semantic (ANN-based) search ensures precise retrieval of relevant results.

  • Flexibility: Developers can customize schemas, rank profiles, and query logic to fit specific use cases.

By following this tutorial, developers can build scalable and intelligent video search systems that enhance user experiences and unlock new possibilities in content discovery.