Partnerships

Partnerships

Partnerships

Multimodal RAG: Chat with Videos Using TwelveLabs and Chroma

James Le

James Le

James Le

This guide demonstrated how to interact with video content, either individually or as a complete set. It utilized the TwelveLabs Embed API and Chroma’s vector database for managing retrieval running locally on your machine.

This guide demonstrated how to interact with video content, either individually or as a complete set. It utilized the TwelveLabs Embed API and Chroma’s vector database for managing retrieval running locally on your machine.

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

Mar 21, 2025

Mar 21, 2025

Mar 21, 2025

20 Min

20 Min

20 Min

Copy link to article

Copy link to article

Copy link to article

Big thanks to Jeff Huber and Itai Smith from the Chroma team for reviewing the draft!


Welcome to this tutorial on integrating TwelveLabs' Embed API with Chroma’s Vector Database for RAG-based Q&A on videos. In this guide, you'll learn how to extract text answers from an unstructured database of videos using generative models.

We combine TwelveLabs' rich, contextual embeddings with Chroma’s vector database to store, index, and query these video embeddings, creating a chat application. This notebook demonstrates the current possibilities of these technologies with just a few lines of code.

For comparison, we also showcase the difference in developer experience between using TwelveLabs' Generate API to generate text responses and a leading open-source model, LLaVA-NeXT-Video.


1 - Overview

In this tutorial we will go through several examples where we use TwelveLabs Marengo to create video embeddings, use Chroma to store and query those embeddings to find relevant videos, and use TwelveLabs Pegasus and LLaVA-NeXT-Video to chat with the returned videos.

We will:

  1. Create Video Embeddings Using the TwelveLabs Marengo Engine

  2. Store Video Embeddings in a Chroma Database

  3. Query Embeddings in our Chroma Database to Find Relevant Video Segments

  4. Use TwelveLabs Pegasus to Chat with the Returned Video Segment

  5. Use an Open Source Model to Chat with the Returned Video Segment

  6. Compare Pegasus to the Open Source model

  7. Use Chroma and TwelveLabs Embeddings to Search Multiple Videos

  8. Use Pegasus to Chat with a Full Video

  9. Use an Open Source Model to Chat with a Full Video


2 - Setup and Installation


Install Required Libraries

First, we will install the TwelveLabs and Chroma SDKs.

# Install Twelve Labs and Chroma libraries
!pip install --upgrade twelvelabs
!pip install --upgrade chromadb

Next, we will install the libraries to be used to run our open source model.

# Install libraries for use with the open source model
!pip install protobuf==3.20.3
!pip install --upgrade -q accelerate bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git
!pip install av

If you are running this tutorial outside Colab, uncomment the cell below to install the libraries used to view the video data we are working with.

# Extra Things to install if you're not on colab
# !python -m pip install pillow
# !python -m pip install sentencepiece
# !python -m pip install matplotlib


3 - Preparing the Video Data

Next, we will prepare our video data.


Using our video Data

This demo uses video data from a TwelveLabs Google Drive folder. To use it, you'll need to link the folder to your Google Drive, and then mount your Google Drive to this Colab.


Linking the folder to our Google Drive

Anyone can access the folder with this link: https://drive.google.com/drive/folders/1k6FmkVglFsdtJG4MTIK-2dk1Dk9gTPtu?usp=share_link

To link this to the correct spot in your Google Drive:

  1. Go to "Shared with me" in Google Drive.

  2. Locate the shared folder you want to access.

  3. Select "Organize" -> "Add Shortcut"

  4. Choose "My Drive" as the destination and click "Add".

Now this folder should be accessible at /content/drive/MyDrive/TwelveLabs-Chroma


Mounting Drive

Here we mount your drive to this Colab.

from google.colab import drive
drive.mount('/content/drive')


Set Video Path

Then we set the path for the videos we will be working with. This may change based on where you link the video folder.

video_folder_path = "/content/drive/MyDrive/TwelveLabs-Chroma/videos/"


Upscale Video Resolution

Some of our videos are too low resolution to use in the embedding engine, so we will double their resolution with upscale_video.

import numpy as np
import subprocess
import os

def upscale_video(input_file, output_path, target_width=854, target_height=480):

    output_file = os.path.join(output_path, os.path.basename(input_file))

    if os.path.exists(output_file):
        print(f"Skipping {input_file} as {output_file} already exists.")
        return
 

    """
    Upscale a video to the target width and height using FFmpeg.

    Args:
        input_file (str): Path to the input video file.
        output_file (str): Path to save the upscaled video.
        target_width (int): Desired output width. Default is 854.
        target_height (int): Desired output height. Default is 480.
    """
    # FFmpeg command to upscale the video
    ffmpeg_command = [
        'ffmpeg',
        '-i', input_file,                              # Input file
        '-vf', f'scale={target_width}:{target_height}', # Scale filter with target dimensions
        '-c:a', 'copy',                                # Copy audio stream without re-encoding
        output_file,                                    # Output file
        "-y"
    ]

    # Run the FFmpeg command
    subprocess.run(ffmpeg_command)

    print(f"Upscaled video saved to {output_file}")

First, we create the path where we will save the upscaled videos.

upscaled_video_dir = video_folder_path + "upscaled_videos/"

Then, we will upscale the videos. This will skip the videos we have already upscaled and placed in our upscaled video folder.

# Upscale all .mp4 videos
# Create output directory if it doesn't exist
if not os.path.exists(upscaled_video_dir):
    os.makedirs(upscaled_video_dir)

# Iterate over all files in the raw video directory
for filename in os.listdir(video_folder_path):
    # Check if the file is a video file
    input_filepath = os.path.join(video_folder_path, filename)
    if filename.endswith(".mp4"):
        upscale_video(input_filepath, upscaled_video_dir)
# Output
...
Upscaled video saved to ../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4
...
Upscaled video saved to ../videos/upscaled_videos/How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4
...
Upscaled video saved to ../videos/upscaled_videos/How To Make Potato Wedges [eZXbMWPJkKQ]


4 - Create Video Embeddings Using the TwelveLabs Marengo Engine

Here we will use the TwelveLabs Marengo Engine to create embeddings for our video.

We will start by loading our TwelveLabs API key. This can be stored in your Colab secrets store which can be found by clicking the “Secrets” (Key Icon) on the left panel of your Colab and then clicking “+ Add new secret.”

from google.colab import userdata
TL_API_KEY=userdata.get('TL_API_KEY')

Next we initiate the TwelveLabs and Chroma clients. The Chroma database in this tutorial is stored locally.

from twelvelabs import TwelveLabs
from twelvelabs.models.embed import EmbeddingsTask

# Initialize the Twelve Labs client
twelvelabs_client = TwelveLabs(api_key=TL_API_KEY)

Here we use the Ephemeral Client, which is great for experimenting. It is run in memory, but data is not persisted after the application closes.

import chromadb

# Initialize Ephemeral Chroma Client
chroma_client = chromadb.Client()

Alternate Clients

If you need persistent data, you have two options:

1 - Persistent Client: This will save and load your database to your local disk.

# Persistent Client Option
chroma_client = chromadb.PersistentClient(path="/path/to/save/to")

2 - HTTP Client: This will connect to a Chroma server running in a separate process - either self deployed, or on Chroma Cloud.

# Http Client
chroma_client = chromadb.HttpClient(host='localhost', port=8000)


Create Video Embeddings and Format for Chroma

Here we create video embeddings using Marengo and format for Chroma. To upload data to Chroma you need three separate lists for all the data that you want to upload: embeddings, meta-datas, and ids.

def on_task_update(task: EmbeddingsTask):
    print(f"  Status={task.status}")

# Create video embeddings and format for Chroma
def create_video_embeddings(client,video_file,segment_length,task_id=None):

    #upload video to twelve labs if it does not already exist
    video_name = os.path.basename(video_file)

    if task_id == None or task_id == "":
        task = client.embed.task.create(
            engine_name="Marengo-retrieval-2.7",
            video_file=video_file,
            video_clip_length=segment_length
        )
        print(
            f"Created task: id={task.id} engine_name={task.engine_name} status={task.status}"
        )

        status = task.wait_for_done(
            sleep_interval=2,
            callback=on_task_update
        )

        print(f"Embedding done: {status}")

        task_id = task.id

    #fetch embeddings
    task = client.embed.task.retrieve(task_id)
    print("task",task)

    #format for chroma
    embeddings = []
    metadatas = []
    ids = []

    idx = 0

    print("embeddings",task.video_embeddings)

    if task.video_embeddings is not None:
        for v in task.video_embeddings:

            metadata = {
                "embedding_scope":v.embedding_scope,
                "start_offset_sec":v.start_offset_sec,
                "end_offset_sec":v.end_offset_sec,
                "video_file":video_file,
                "video_name":video_name,
                "task_id":task.id,
                "video_segment_number":idx
            }


            embedding = v.values
            id = task.id + "_" + str(idx)

            metadatas.append(metadata)
            embeddings.append(embedding)
            ids.append(id)

            idx += 1

    return (ids,metadatas,embeddings,task_id)

Next, we select the video we will be working with and set a 6 second embedding segment duration - each embedding will refer to a 6 second segment in the video.

# Set the segment duration and the video we will be working with
segment_duration = 6
current_video_path = upscaled_video_dir + "How To Make Birria Tacos [4nIFJFgH99w].mp4"

Then, we retrieve our embeddings data to be uploaded to Chroma. We also get our embeddings task_id from TwelveLabs. This task _id can be used to retrieve embeddings for this video in the future so we don’t have to run the embedding model multiple times for the same video.

# Get embeddings to upload to Chroma

# Set task_id if you already have one, otherwise set to empty string
task_id = ""
ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,current_video_path,segment_duration,task_id)

You can find the task_id in the output.



5 - Store Video Embeddings in a Chroma Database

Now that we have our records and vectors in an easy format, we can simply add them to a new collection in Chroma (or, an existing collection if it already exists).

# Fetch or create a Chroma Collection
chroma_collection_name = "video_embeddings"
collection = chroma_client.get_or_create_collection(chroma_collection_name)

# Add embeddings and metadata to our collection
collection.add(
    metadatas = metadatas,
    embeddings = embeddings,
    ids=ids
)


6 - Query Embeddings in our Chroma Database to Find Relevant Video Segments


Testing the Vector Search

Now that we have everything in the collection, we can test and see that the embeddings query works. We will search by the first returned embedding, which should have a distance of zero to itself.

# Use first embedding as a test search
test_segment_embeddings = embeddings[0]

results = collection.query(
    query_embeddings=[test_segment_embeddings],
    n_results=4
)

print("search embeddings for:",ids[0])
print("found:", results["ids"][0][0])
print("distance:",results["distances"][0][0])

#assert that the first video's text embedding is distance 0 from itself
assert results["ids"][0][0] == ids[0]
assert results["distances"][0][0] == 0


Querying our Vector Database

Now we will show the power of TwelveLabs embedding and the Chroma vector database to efficiently query a large number of videos.

The video we chose earlier is a cooking tutorial for birria tacos. We will use TwelveLabs and Chroma to figure out what ingredients we need to create birria tacos.

We set that query here.

query = "What are the ingredients for birria tacos?"

Next, we create a function that uses TwelveLabs Marengo to embed a text query and then query our chroma collection using that embedding.

import os

def query_chroma(collection,query,n_results=1):
    #Create embedding for query
    embedding = twelvelabs_client.embed.create(
        engine_name="Marengo-retrieval-2.7",
        text=query,
        text_truncate="start",
    )

    query_embeddings = embedding.text_embedding.float

    #Search Chroma database with query embedding

    response = collection.query(
        query_embeddings=query_embeddings,
        n_results=n_results,
    )


    return response

Then, we run our Chroma query to find the 6 second segment of our tutorial that best answers this query. Later in this tutorial we will pass this segment into TwelveLabs Pegasus and an open source model to ask it questions including which ingredients we need.

Our metadata has information on the segment and the video it came from.

response = query_chroma(collection,query)

# Print the properties and distance of the most similar object
print(response["ids"][0][0])
print(response["distances"][0][0])
print(response["metadatas"][0][0])

# Get the path for the found video segment for the next step
found_video_metadata = response["metadatas"][0][0]
# Output

672101d56025850d8c890d1c_6
1.3052971363067627
{'embedding_scope': 'clip', 
'end_offset_sec': 42.0, 
'start_offset_sec': 36.0, 
'task_id': '672101d56025850d8c890d1c', 
'video_file': '../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4', 
'video_name': 'How To Make Birria Tacos [4nIFJFgH99w]


7 - Splitting Videos into Segments

While we could upload the entire tutorial video to Pegasus, we will only upload the relevant 6 second segment we found in the previous step. This allows us to be efficient with compute.

Here we split the video we are working with into segments that match our embeddings segments and put them into the split videos folder.

split_video_dir = video_folder_path + "split_videos/"

def split_video(input_path, output_dir, segment_duration=6):

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    filename = os.path.splitext(os.path.basename(input_path))[0]
    filetype = os.path.splitext(os.path.basename(input_path))[1]

    # Split video into segments
    ffmpeg_command = [
        'ffmpeg',
        '-i', input_path,             # Input video file
        '-c', 'copy',                  # Copy both video and audio codecs
        '-f', 'segment',               # Segment mode
        '-segment_time', str(segment_duration),  # Segment length
        '-reset_timestamps', '1',      # Reset timestamps for each segment
        output_dir + filename + '_%03d' + filetype  # Output filename pattern (e.g., output_001.mp4)
    ]

    # Run the command
    subprocess.run(ffmpeg_command)

    print("Video split into 6-second segments successfully.")

# Split the video into segments
split_video(input_path=current_video_path, output_dir=split_video_dir, segment_duration=segment_duration)
# Output

...
[segment @ 0x561c0a7b4800] Opening '../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]_001.mp4' for writing
...
[segment @ 0x561c0a7b4800] Opening '../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]


8 - Use TwelveLabs Pegasus to Chat with the Returned Video Segment

These next few cells will show us how simple it is to use Pegasus to chat with a video -- Everything comes ready out of the box.


Uploading Video Segment to Pegasus

First, we will create an index for our video uploads and the Pegasus Engine, then upload them.

# Create or retrieve pegasus index
engines = [
        {
            "name": "pegasus1.2",
            "options": ["visual", "conversation"]
        }
    ]

index_name = "cooking_video_index"
indices_list = twelvelabs_client.index.list(name=index_name)

if len(indices_list) == 0:
    index = twelvelabs_client.index.create(
        name=index_name,
        engines=engines,

    )
    print(f"A new index has been created: id={index.id} name={index.name} engines={index.engines}")
else:
    index = indices_list[0]
    print(f"Index already exists: id={index.id} name={index.name} engines={index.engines}")
# Output

A new index has been created: id=672120b06025850d8c890db2 
name=cooking_video_index 
engines=root=[Engine(name='pegasus1.2', options=['visual', 'conversation']


Get Video Segment File Name

Next, we will use the metadata from our Chroma query to find the segment that matched our query.

# Get video segment filename
found_video_segment_number = int(found_video_metadata["video_segment_number"])
found_video_file = found_video_metadata["video_file"]
found_video_filename = os.path.splitext(os.path.basename(found_video_file))[0]
found_video_filetype = os.path.splitext(os.path.basename(found_video_file))[1]
found_video_segment_filename = found_video_filename + f"_{found_video_segment_number:03d}"

found_video_segment_path = split_video_dir + found_video_segment_filename + found_video_filetype
print(found_video_segment_path)
# Output

../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]


Upload Video to Pegasus and Get Video ID

We then create a function to upload our video segment to TwelveLabs Pegasus.

This function returns a video_id that we will use to chat with the video. We can store this video_id for multi-turn chat.

def upload_video_to_twelve_labs(index,video_path):

    #upload our video to our twelve labs index
    task = twelvelabs_client.task.create(
        index_id=index.id,
        file = video_path
    )
    print(f"Task created: id={task.id} status={task.status}")

    task.wait_for_done(sleep_interval=5, callback=on_task_update)

    if task.status != "ready":
      raise RuntimeError(f"Indexing failed with status {task.status}")
    print(f"The unique identifer of your video is {task.video_id}.")

    #return the video id
    return task.video_id

If we already have a video id for this video segment, we can store it here.

# Set video_id if you already have one, otherwise set to empty string
video_id = ""

# Upload video to get video id to chat with in Pegasus
if video_id == "":
    video_id = upload_video_to_twelve_labs(index,found_video_segment_path)

You can find the video_id in the output.



Calling Pegasus

Here we query the video segment with the query we used to find it. Because TwelveLabs handles all of the boilerplate behind the scenes, we can call our model with a simple function.


9 - Use an Open Source Model to Chat with the Returned Video Segment

Now we will use an open source model to chat with our video segment and compare it to Pegasus.

First, we need to sample the videos ourselves for the model to consume. We'll modify the LLaVa-NeXT-Video Sampling code to get a uniform sample of 8 frames for each video.

And we can do this for all of the video segments in our folder.

read_video_pyav comes directly from the LLaVa-NeXT-Video Colab notebook and it formats videos in the correct numpy representation for inference.

import av
def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

def sample_video(video_path, num_samples=8):
    container = av.open(video_path)

    # sample uniformly num_samples frames from the video
    total_frames = container.streams.video[0].frames
    indices = np.arange(0, total_frames, total_frames / num_samples).astype(int)

    sampled_frames = read_video_pyav(container, indices)

    return sampled_frames

def process_videos_in_folder(folder_path):
    sample_info = {}

    # Supported video file extensions
    video_extensions = ('.mp4', '.avi', '.mov', '.mkv')

    for filename in os.listdir(folder_path):
        simple_video_name = os.path.splitext(os.path.basename(filename))[0]
        if filename.lower().endswith(video_extensions):
            video_path = os.path.join(folder_path, filename)
            try:
                sampled_clip = sample_video(video_path)
                sample_info[simple_video_name] = {"sampled_video": sampled_clip, "video_path" : video_path}
            except Exception as e:
                print(f"Error processing {filename}: {str(e)}")

    return sample_info

sampled_video_info = process_videos_in_folder(split_video_dir)

# Get video segment found in our Chroma query
video_segment = sampled_video_info[found_video_segment_filename]['sampled_video']


Setting up our Model

We'll set up our model in 4-bit quantization to speed up inference.

from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf",
    quantization_config=quantization_config,
    device_map='auto'
)

# To use later to play the videos in the notebook itself

from matplotlib import pyplot as plt
from matplotlib import animation
from IPython.display import HTML

# np array with shape (frames, height, width, channels)
# pick one at random just to see
video = sampled_video_info[list(sampled_video_info.keys())[0]]['sampled_video']

fig = plt.figure()
im = plt.imshow(video[0,:,:,:])

plt.close() # this is required to not display the generated image

def init():
    im.set_data(video[0,:,:,:])

def animate(i):
    im.set_data(video[i,:,:,:])
    return im

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0],
                               interval=100)
HTML(anim.to_html5_video())


Running the Model

Now that we have our query and the relevant video, we can feed them into the model to get an output.

# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": query},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
prompt_len = len(prompt)

inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device)
generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
open_source_segment_generated_text = processor.batch_decode(output, skip_special_tokens=True)

print(open_source_segment_generated_text[0])


10 - Compare Pegasus to the Open Source model

As we can see Pegasus does a better job at answering our query (printed out again for convenience).

print(f"query {query}")
print("pegasus answer")
print(segment_answer)
print("open source answer")
print(open_source_segment_generated_text[0])


11 - Using Chroma and TwelveLabs Embeddings to Search Multiple Videos

We have already shown how we can use TwelveLabs Marengo and Chroma can be used to find relevant segments within a video.

Now we can show the true power of Marengo and Chroma for a RAG use case — by querying a large set of potential videos to find the specific video that we need to answer the query.


Embedding our Video Database:

First we will embed all of our videos and store those embeddings in Chroma:

# Embed and store task ID's for all videos
chroma_collection_name = "video_embeddings"
collection = chroma_client.get_or_create_collection(chroma_collection_name)

Then, we create a dictionary to store task_ids for each video in case we need to retrieve these embeddings from TwelveLabs in the future.

# Store twelve labs task ids for each video
task_ids = {}

Next, we get the embeddings for each video using the create_video_embeddings function we created earlier and upload them to our Chroma collection.

# Get embeddings and metadata for each video
# Store task ids so we don't upload videos multiple times
for filename in os.listdir(upscaled_video_dir):

    if filename.endswith(".mp4"):

        if (filename in task_ids.keys()):
            task_id = task_ids[filename]
        else:
            task_id = None

        file_path = os.path.join(upscaled_video_dir, filename)

        ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,file_path,segment_duration,task_id)

        task_ids[filename] = task_id

        collection.add(
            metadatas = metadatas,
            embeddings = embeddings,
            ids=ids
        )

print(task_ids)
# Output

{'How To Make Birria Tacos [4nIFJFgH99w].mp4': '67211a5430c547875065b3f7', 
"How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4": '67211ab56025850d8c890d93', 
'How To Make Potato Wedges [eZXbMWPJkKQ]


Querying our Database

Here we use the same query as before to find the full video that will answer our query.

response = query_chroma(collection,query)
found_full_video_name = response["metadatas"][0][0]["video_name"]
print(found_full_video_name)
# Output

How To Make Birria Tacos [4nIFJFgH99w]


12 - Use Pegasus to Chat with a Full Video

We already have an index created, so we just need to upload the videos to this index then call Pegasus.

# Store pegasus video ids so that we don't upload videos multiple times
pegasus_video_ids = {}

We upload each video here and store the video ids we will use to chat with the video.

for upscaled_video in os.listdir(upscaled_video_dir):
    upscaled_video_path = os.path.join(upscaled_video_dir, upscaled_video)
    print(upscaled_video_path)
    if upscaled_video not in pegasus_video_ids:
        video_id = upload_video_to_twelve_labs(index,upscaled_video_path)
        pegasus_video_ids[upscaled_video] = video_id
# Output

../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4
Task created: id=67211b6b30c547875065b407 status=validating
  Status=pending
...
  Status=indexing
...
  Status=ready
The unique identifer of your video is 67211b6c1a5f2afde113615e.
../videos/upscaled_videos/How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4
Task created: id=67211c446025850d8c890da5 status=validating
  Status=pending
...
  Status=indexing
...
  Status=ready
The unique identifer of your video is 67211c451a5f2afde113615f.
../videos/upscaled_videos/How To Make Potato Wedges [eZXbMWPJkKQ]

print(pegasus_video_ids)
# Output

{'How To Make Birria Tacos [4nIFJFgH99w].mp4': '67211b6c1a5f2afde113615e', 
"How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4": '67211c451a5f2afde113615f', 
'How To Make Potato Wedges [eZXbMWPJkKQ]


Calling Pegasus to Chat with Full Video

First, we find the video_id associated with the video that matched our Chroma query.

video_id = pegasus_video_ids[found_full_video_name]
print(video_id)

Next, we ask Pegasus what the ingredients are for birria tacos. It is exactly the same as the query we used before, but this time we are chatting with the entire video instead of just a 6 second segment.

res = twelvelabs_client.generate.text(
  video_id=video_id,
  prompt=query
)
full_video_answer = res.data
print(f"query {query}")
print(f"{full_video_answer}")


Compare full video answer to segment answer

We can then compare this to the answer given when we chatted with the 6 second segment.

print(f"segment answer: \n{segment_answer}")


13 - Use an Open Source Model to Chat with a Full Video

Now we will compare Pegasus to the open source model when chatting with a full video.

After we sample all of the videos again, we can run our model on the full video, which outputs some more interesting answers.

# Sample all of the videos:
sampled_database_video_info = process_videos_in_folder(upscaled_video_dir)

# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": query},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
prompt_len = len(prompt)

inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device)
generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)

print(generated_text[0])


Compare Result to Pegasus Answer

As we can see the open source model cannot give us an answer when chatting with the entire video either, showing the power of Pegasus:

print(f"Pegasus answer: \n{full_video_answer}")


14 - Comparison

Comparing our two models, it is clear that Pegasus has a clear understanding of our video and is able to give us an accurate response to our query. It understands that this is a tutorial for birria tacos and is able to give us the ingredient list. It is able to do this with the entire video and video segments.

The LLaVA-NeXT-Video open-source model understands our query, but is not able to extract the relevant information from the video. LLaVA-NeXT-Video does have some understanding that this is a cooking video, but does not have the richness in understanding needed for our query.


15 - Conclusion

This guide demonstrated how to interact with video content, either individually or as a complete set. It utilized the TwelveLabs Embed API and Chroma’s vector database for managing retrieval running locally on your machine.

We also compared TwelveLabs' Pegasus model with the LLaVA-NeXT-Video open-source model, evaluating the required infrastructure, developer experience, and query results. Compared to the open-source model, Pegasus showed promise with less operational overhead, better instruction following, and larger context lengths for easier querying of longer videos.


Appendix

For your reference and further exploration:

  1. Complete Colab Notebook

  2. TwelveLabs Documentation

  3. Chroma’s Client and Documentation

Big thanks to Jeff Huber and Itai Smith from the Chroma team for reviewing the draft!


Welcome to this tutorial on integrating TwelveLabs' Embed API with Chroma’s Vector Database for RAG-based Q&A on videos. In this guide, you'll learn how to extract text answers from an unstructured database of videos using generative models.

We combine TwelveLabs' rich, contextual embeddings with Chroma’s vector database to store, index, and query these video embeddings, creating a chat application. This notebook demonstrates the current possibilities of these technologies with just a few lines of code.

For comparison, we also showcase the difference in developer experience between using TwelveLabs' Generate API to generate text responses and a leading open-source model, LLaVA-NeXT-Video.


1 - Overview

In this tutorial we will go through several examples where we use TwelveLabs Marengo to create video embeddings, use Chroma to store and query those embeddings to find relevant videos, and use TwelveLabs Pegasus and LLaVA-NeXT-Video to chat with the returned videos.

We will:

  1. Create Video Embeddings Using the TwelveLabs Marengo Engine

  2. Store Video Embeddings in a Chroma Database

  3. Query Embeddings in our Chroma Database to Find Relevant Video Segments

  4. Use TwelveLabs Pegasus to Chat with the Returned Video Segment

  5. Use an Open Source Model to Chat with the Returned Video Segment

  6. Compare Pegasus to the Open Source model

  7. Use Chroma and TwelveLabs Embeddings to Search Multiple Videos

  8. Use Pegasus to Chat with a Full Video

  9. Use an Open Source Model to Chat with a Full Video


2 - Setup and Installation


Install Required Libraries

First, we will install the TwelveLabs and Chroma SDKs.

# Install Twelve Labs and Chroma libraries
!pip install --upgrade twelvelabs
!pip install --upgrade chromadb

Next, we will install the libraries to be used to run our open source model.

# Install libraries for use with the open source model
!pip install protobuf==3.20.3
!pip install --upgrade -q accelerate bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git
!pip install av

If you are running this tutorial outside Colab, uncomment the cell below to install the libraries used to view the video data we are working with.

# Extra Things to install if you're not on colab
# !python -m pip install pillow
# !python -m pip install sentencepiece
# !python -m pip install matplotlib


3 - Preparing the Video Data

Next, we will prepare our video data.


Using our video Data

This demo uses video data from a TwelveLabs Google Drive folder. To use it, you'll need to link the folder to your Google Drive, and then mount your Google Drive to this Colab.


Linking the folder to our Google Drive

Anyone can access the folder with this link: https://drive.google.com/drive/folders/1k6FmkVglFsdtJG4MTIK-2dk1Dk9gTPtu?usp=share_link

To link this to the correct spot in your Google Drive:

  1. Go to "Shared with me" in Google Drive.

  2. Locate the shared folder you want to access.

  3. Select "Organize" -> "Add Shortcut"

  4. Choose "My Drive" as the destination and click "Add".

Now this folder should be accessible at /content/drive/MyDrive/TwelveLabs-Chroma


Mounting Drive

Here we mount your drive to this Colab.

from google.colab import drive
drive.mount('/content/drive')


Set Video Path

Then we set the path for the videos we will be working with. This may change based on where you link the video folder.

video_folder_path = "/content/drive/MyDrive/TwelveLabs-Chroma/videos/"


Upscale Video Resolution

Some of our videos are too low resolution to use in the embedding engine, so we will double their resolution with upscale_video.

import numpy as np
import subprocess
import os

def upscale_video(input_file, output_path, target_width=854, target_height=480):

    output_file = os.path.join(output_path, os.path.basename(input_file))

    if os.path.exists(output_file):
        print(f"Skipping {input_file} as {output_file} already exists.")
        return
 

    """
    Upscale a video to the target width and height using FFmpeg.

    Args:
        input_file (str): Path to the input video file.
        output_file (str): Path to save the upscaled video.
        target_width (int): Desired output width. Default is 854.
        target_height (int): Desired output height. Default is 480.
    """
    # FFmpeg command to upscale the video
    ffmpeg_command = [
        'ffmpeg',
        '-i', input_file,                              # Input file
        '-vf', f'scale={target_width}:{target_height}', # Scale filter with target dimensions
        '-c:a', 'copy',                                # Copy audio stream without re-encoding
        output_file,                                    # Output file
        "-y"
    ]

    # Run the FFmpeg command
    subprocess.run(ffmpeg_command)

    print(f"Upscaled video saved to {output_file}")

First, we create the path where we will save the upscaled videos.

upscaled_video_dir = video_folder_path + "upscaled_videos/"

Then, we will upscale the videos. This will skip the videos we have already upscaled and placed in our upscaled video folder.

# Upscale all .mp4 videos
# Create output directory if it doesn't exist
if not os.path.exists(upscaled_video_dir):
    os.makedirs(upscaled_video_dir)

# Iterate over all files in the raw video directory
for filename in os.listdir(video_folder_path):
    # Check if the file is a video file
    input_filepath = os.path.join(video_folder_path, filename)
    if filename.endswith(".mp4"):
        upscale_video(input_filepath, upscaled_video_dir)
# Output
...
Upscaled video saved to ../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4
...
Upscaled video saved to ../videos/upscaled_videos/How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4
...
Upscaled video saved to ../videos/upscaled_videos/How To Make Potato Wedges [eZXbMWPJkKQ]


4 - Create Video Embeddings Using the TwelveLabs Marengo Engine

Here we will use the TwelveLabs Marengo Engine to create embeddings for our video.

We will start by loading our TwelveLabs API key. This can be stored in your Colab secrets store which can be found by clicking the “Secrets” (Key Icon) on the left panel of your Colab and then clicking “+ Add new secret.”

from google.colab import userdata
TL_API_KEY=userdata.get('TL_API_KEY')

Next we initiate the TwelveLabs and Chroma clients. The Chroma database in this tutorial is stored locally.

from twelvelabs import TwelveLabs
from twelvelabs.models.embed import EmbeddingsTask

# Initialize the Twelve Labs client
twelvelabs_client = TwelveLabs(api_key=TL_API_KEY)

Here we use the Ephemeral Client, which is great for experimenting. It is run in memory, but data is not persisted after the application closes.

import chromadb

# Initialize Ephemeral Chroma Client
chroma_client = chromadb.Client()

Alternate Clients

If you need persistent data, you have two options:

1 - Persistent Client: This will save and load your database to your local disk.

# Persistent Client Option
chroma_client = chromadb.PersistentClient(path="/path/to/save/to")

2 - HTTP Client: This will connect to a Chroma server running in a separate process - either self deployed, or on Chroma Cloud.

# Http Client
chroma_client = chromadb.HttpClient(host='localhost', port=8000)


Create Video Embeddings and Format for Chroma

Here we create video embeddings using Marengo and format for Chroma. To upload data to Chroma you need three separate lists for all the data that you want to upload: embeddings, meta-datas, and ids.

def on_task_update(task: EmbeddingsTask):
    print(f"  Status={task.status}")

# Create video embeddings and format for Chroma
def create_video_embeddings(client,video_file,segment_length,task_id=None):

    #upload video to twelve labs if it does not already exist
    video_name = os.path.basename(video_file)

    if task_id == None or task_id == "":
        task = client.embed.task.create(
            engine_name="Marengo-retrieval-2.7",
            video_file=video_file,
            video_clip_length=segment_length
        )
        print(
            f"Created task: id={task.id} engine_name={task.engine_name} status={task.status}"
        )

        status = task.wait_for_done(
            sleep_interval=2,
            callback=on_task_update
        )

        print(f"Embedding done: {status}")

        task_id = task.id

    #fetch embeddings
    task = client.embed.task.retrieve(task_id)
    print("task",task)

    #format for chroma
    embeddings = []
    metadatas = []
    ids = []

    idx = 0

    print("embeddings",task.video_embeddings)

    if task.video_embeddings is not None:
        for v in task.video_embeddings:

            metadata = {
                "embedding_scope":v.embedding_scope,
                "start_offset_sec":v.start_offset_sec,
                "end_offset_sec":v.end_offset_sec,
                "video_file":video_file,
                "video_name":video_name,
                "task_id":task.id,
                "video_segment_number":idx
            }


            embedding = v.values
            id = task.id + "_" + str(idx)

            metadatas.append(metadata)
            embeddings.append(embedding)
            ids.append(id)

            idx += 1

    return (ids,metadatas,embeddings,task_id)

Next, we select the video we will be working with and set a 6 second embedding segment duration - each embedding will refer to a 6 second segment in the video.

# Set the segment duration and the video we will be working with
segment_duration = 6
current_video_path = upscaled_video_dir + "How To Make Birria Tacos [4nIFJFgH99w].mp4"

Then, we retrieve our embeddings data to be uploaded to Chroma. We also get our embeddings task_id from TwelveLabs. This task _id can be used to retrieve embeddings for this video in the future so we don’t have to run the embedding model multiple times for the same video.

# Get embeddings to upload to Chroma

# Set task_id if you already have one, otherwise set to empty string
task_id = ""
ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,current_video_path,segment_duration,task_id)

You can find the task_id in the output.



5 - Store Video Embeddings in a Chroma Database

Now that we have our records and vectors in an easy format, we can simply add them to a new collection in Chroma (or, an existing collection if it already exists).

# Fetch or create a Chroma Collection
chroma_collection_name = "video_embeddings"
collection = chroma_client.get_or_create_collection(chroma_collection_name)

# Add embeddings and metadata to our collection
collection.add(
    metadatas = metadatas,
    embeddings = embeddings,
    ids=ids
)


6 - Query Embeddings in our Chroma Database to Find Relevant Video Segments


Testing the Vector Search

Now that we have everything in the collection, we can test and see that the embeddings query works. We will search by the first returned embedding, which should have a distance of zero to itself.

# Use first embedding as a test search
test_segment_embeddings = embeddings[0]

results = collection.query(
    query_embeddings=[test_segment_embeddings],
    n_results=4
)

print("search embeddings for:",ids[0])
print("found:", results["ids"][0][0])
print("distance:",results["distances"][0][0])

#assert that the first video's text embedding is distance 0 from itself
assert results["ids"][0][0] == ids[0]
assert results["distances"][0][0] == 0


Querying our Vector Database

Now we will show the power of TwelveLabs embedding and the Chroma vector database to efficiently query a large number of videos.

The video we chose earlier is a cooking tutorial for birria tacos. We will use TwelveLabs and Chroma to figure out what ingredients we need to create birria tacos.

We set that query here.

query = "What are the ingredients for birria tacos?"

Next, we create a function that uses TwelveLabs Marengo to embed a text query and then query our chroma collection using that embedding.

import os

def query_chroma(collection,query,n_results=1):
    #Create embedding for query
    embedding = twelvelabs_client.embed.create(
        engine_name="Marengo-retrieval-2.7",
        text=query,
        text_truncate="start",
    )

    query_embeddings = embedding.text_embedding.float

    #Search Chroma database with query embedding

    response = collection.query(
        query_embeddings=query_embeddings,
        n_results=n_results,
    )


    return response

Then, we run our Chroma query to find the 6 second segment of our tutorial that best answers this query. Later in this tutorial we will pass this segment into TwelveLabs Pegasus and an open source model to ask it questions including which ingredients we need.

Our metadata has information on the segment and the video it came from.

response = query_chroma(collection,query)

# Print the properties and distance of the most similar object
print(response["ids"][0][0])
print(response["distances"][0][0])
print(response["metadatas"][0][0])

# Get the path for the found video segment for the next step
found_video_metadata = response["metadatas"][0][0]
# Output

672101d56025850d8c890d1c_6
1.3052971363067627
{'embedding_scope': 'clip', 
'end_offset_sec': 42.0, 
'start_offset_sec': 36.0, 
'task_id': '672101d56025850d8c890d1c', 
'video_file': '../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4', 
'video_name': 'How To Make Birria Tacos [4nIFJFgH99w]


7 - Splitting Videos into Segments

While we could upload the entire tutorial video to Pegasus, we will only upload the relevant 6 second segment we found in the previous step. This allows us to be efficient with compute.

Here we split the video we are working with into segments that match our embeddings segments and put them into the split videos folder.

split_video_dir = video_folder_path + "split_videos/"

def split_video(input_path, output_dir, segment_duration=6):

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    filename = os.path.splitext(os.path.basename(input_path))[0]
    filetype = os.path.splitext(os.path.basename(input_path))[1]

    # Split video into segments
    ffmpeg_command = [
        'ffmpeg',
        '-i', input_path,             # Input video file
        '-c', 'copy',                  # Copy both video and audio codecs
        '-f', 'segment',               # Segment mode
        '-segment_time', str(segment_duration),  # Segment length
        '-reset_timestamps', '1',      # Reset timestamps for each segment
        output_dir + filename + '_%03d' + filetype  # Output filename pattern (e.g., output_001.mp4)
    ]

    # Run the command
    subprocess.run(ffmpeg_command)

    print("Video split into 6-second segments successfully.")

# Split the video into segments
split_video(input_path=current_video_path, output_dir=split_video_dir, segment_duration=segment_duration)
# Output

...
[segment @ 0x561c0a7b4800] Opening '../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]_001.mp4' for writing
...
[segment @ 0x561c0a7b4800] Opening '../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]


8 - Use TwelveLabs Pegasus to Chat with the Returned Video Segment

These next few cells will show us how simple it is to use Pegasus to chat with a video -- Everything comes ready out of the box.


Uploading Video Segment to Pegasus

First, we will create an index for our video uploads and the Pegasus Engine, then upload them.

# Create or retrieve pegasus index
engines = [
        {
            "name": "pegasus1.2",
            "options": ["visual", "conversation"]
        }
    ]

index_name = "cooking_video_index"
indices_list = twelvelabs_client.index.list(name=index_name)

if len(indices_list) == 0:
    index = twelvelabs_client.index.create(
        name=index_name,
        engines=engines,

    )
    print(f"A new index has been created: id={index.id} name={index.name} engines={index.engines}")
else:
    index = indices_list[0]
    print(f"Index already exists: id={index.id} name={index.name} engines={index.engines}")
# Output

A new index has been created: id=672120b06025850d8c890db2 
name=cooking_video_index 
engines=root=[Engine(name='pegasus1.2', options=['visual', 'conversation']


Get Video Segment File Name

Next, we will use the metadata from our Chroma query to find the segment that matched our query.

# Get video segment filename
found_video_segment_number = int(found_video_metadata["video_segment_number"])
found_video_file = found_video_metadata["video_file"]
found_video_filename = os.path.splitext(os.path.basename(found_video_file))[0]
found_video_filetype = os.path.splitext(os.path.basename(found_video_file))[1]
found_video_segment_filename = found_video_filename + f"_{found_video_segment_number:03d}"

found_video_segment_path = split_video_dir + found_video_segment_filename + found_video_filetype
print(found_video_segment_path)
# Output

../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]


Upload Video to Pegasus and Get Video ID

We then create a function to upload our video segment to TwelveLabs Pegasus.

This function returns a video_id that we will use to chat with the video. We can store this video_id for multi-turn chat.

def upload_video_to_twelve_labs(index,video_path):

    #upload our video to our twelve labs index
    task = twelvelabs_client.task.create(
        index_id=index.id,
        file = video_path
    )
    print(f"Task created: id={task.id} status={task.status}")

    task.wait_for_done(sleep_interval=5, callback=on_task_update)

    if task.status != "ready":
      raise RuntimeError(f"Indexing failed with status {task.status}")
    print(f"The unique identifer of your video is {task.video_id}.")

    #return the video id
    return task.video_id

If we already have a video id for this video segment, we can store it here.

# Set video_id if you already have one, otherwise set to empty string
video_id = ""

# Upload video to get video id to chat with in Pegasus
if video_id == "":
    video_id = upload_video_to_twelve_labs(index,found_video_segment_path)

You can find the video_id in the output.



Calling Pegasus

Here we query the video segment with the query we used to find it. Because TwelveLabs handles all of the boilerplate behind the scenes, we can call our model with a simple function.


9 - Use an Open Source Model to Chat with the Returned Video Segment

Now we will use an open source model to chat with our video segment and compare it to Pegasus.

First, we need to sample the videos ourselves for the model to consume. We'll modify the LLaVa-NeXT-Video Sampling code to get a uniform sample of 8 frames for each video.

And we can do this for all of the video segments in our folder.

read_video_pyav comes directly from the LLaVa-NeXT-Video Colab notebook and it formats videos in the correct numpy representation for inference.

import av
def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

def sample_video(video_path, num_samples=8):
    container = av.open(video_path)

    # sample uniformly num_samples frames from the video
    total_frames = container.streams.video[0].frames
    indices = np.arange(0, total_frames, total_frames / num_samples).astype(int)

    sampled_frames = read_video_pyav(container, indices)

    return sampled_frames

def process_videos_in_folder(folder_path):
    sample_info = {}

    # Supported video file extensions
    video_extensions = ('.mp4', '.avi', '.mov', '.mkv')

    for filename in os.listdir(folder_path):
        simple_video_name = os.path.splitext(os.path.basename(filename))[0]
        if filename.lower().endswith(video_extensions):
            video_path = os.path.join(folder_path, filename)
            try:
                sampled_clip = sample_video(video_path)
                sample_info[simple_video_name] = {"sampled_video": sampled_clip, "video_path" : video_path}
            except Exception as e:
                print(f"Error processing {filename}: {str(e)}")

    return sample_info

sampled_video_info = process_videos_in_folder(split_video_dir)

# Get video segment found in our Chroma query
video_segment = sampled_video_info[found_video_segment_filename]['sampled_video']


Setting up our Model

We'll set up our model in 4-bit quantization to speed up inference.

from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf",
    quantization_config=quantization_config,
    device_map='auto'
)

# To use later to play the videos in the notebook itself

from matplotlib import pyplot as plt
from matplotlib import animation
from IPython.display import HTML

# np array with shape (frames, height, width, channels)
# pick one at random just to see
video = sampled_video_info[list(sampled_video_info.keys())[0]]['sampled_video']

fig = plt.figure()
im = plt.imshow(video[0,:,:,:])

plt.close() # this is required to not display the generated image

def init():
    im.set_data(video[0,:,:,:])

def animate(i):
    im.set_data(video[i,:,:,:])
    return im

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0],
                               interval=100)
HTML(anim.to_html5_video())


Running the Model

Now that we have our query and the relevant video, we can feed them into the model to get an output.

# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": query},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
prompt_len = len(prompt)

inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device)
generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
open_source_segment_generated_text = processor.batch_decode(output, skip_special_tokens=True)

print(open_source_segment_generated_text[0])


10 - Compare Pegasus to the Open Source model

As we can see Pegasus does a better job at answering our query (printed out again for convenience).

print(f"query {query}")
print("pegasus answer")
print(segment_answer)
print("open source answer")
print(open_source_segment_generated_text[0])


11 - Using Chroma and TwelveLabs Embeddings to Search Multiple Videos

We have already shown how we can use TwelveLabs Marengo and Chroma can be used to find relevant segments within a video.

Now we can show the true power of Marengo and Chroma for a RAG use case — by querying a large set of potential videos to find the specific video that we need to answer the query.


Embedding our Video Database:

First we will embed all of our videos and store those embeddings in Chroma:

# Embed and store task ID's for all videos
chroma_collection_name = "video_embeddings"
collection = chroma_client.get_or_create_collection(chroma_collection_name)

Then, we create a dictionary to store task_ids for each video in case we need to retrieve these embeddings from TwelveLabs in the future.

# Store twelve labs task ids for each video
task_ids = {}

Next, we get the embeddings for each video using the create_video_embeddings function we created earlier and upload them to our Chroma collection.

# Get embeddings and metadata for each video
# Store task ids so we don't upload videos multiple times
for filename in os.listdir(upscaled_video_dir):

    if filename.endswith(".mp4"):

        if (filename in task_ids.keys()):
            task_id = task_ids[filename]
        else:
            task_id = None

        file_path = os.path.join(upscaled_video_dir, filename)

        ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,file_path,segment_duration,task_id)

        task_ids[filename] = task_id

        collection.add(
            metadatas = metadatas,
            embeddings = embeddings,
            ids=ids
        )

print(task_ids)
# Output

{'How To Make Birria Tacos [4nIFJFgH99w].mp4': '67211a5430c547875065b3f7', 
"How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4": '67211ab56025850d8c890d93', 
'How To Make Potato Wedges [eZXbMWPJkKQ]


Querying our Database

Here we use the same query as before to find the full video that will answer our query.

response = query_chroma(collection,query)
found_full_video_name = response["metadatas"][0][0]["video_name"]
print(found_full_video_name)
# Output

How To Make Birria Tacos [4nIFJFgH99w]


12 - Use Pegasus to Chat with a Full Video

We already have an index created, so we just need to upload the videos to this index then call Pegasus.

# Store pegasus video ids so that we don't upload videos multiple times
pegasus_video_ids = {}

We upload each video here and store the video ids we will use to chat with the video.

for upscaled_video in os.listdir(upscaled_video_dir):
    upscaled_video_path = os.path.join(upscaled_video_dir, upscaled_video)
    print(upscaled_video_path)
    if upscaled_video not in pegasus_video_ids:
        video_id = upload_video_to_twelve_labs(index,upscaled_video_path)
        pegasus_video_ids[upscaled_video] = video_id
# Output

../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4
Task created: id=67211b6b30c547875065b407 status=validating
  Status=pending
...
  Status=indexing
...
  Status=ready
The unique identifer of your video is 67211b6c1a5f2afde113615e.
../videos/upscaled_videos/How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4
Task created: id=67211c446025850d8c890da5 status=validating
  Status=pending
...
  Status=indexing
...
  Status=ready
The unique identifer of your video is 67211c451a5f2afde113615f.
../videos/upscaled_videos/How To Make Potato Wedges [eZXbMWPJkKQ]

print(pegasus_video_ids)
# Output

{'How To Make Birria Tacos [4nIFJFgH99w].mp4': '67211b6c1a5f2afde113615e', 
"How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4": '67211c451a5f2afde113615f', 
'How To Make Potato Wedges [eZXbMWPJkKQ]


Calling Pegasus to Chat with Full Video

First, we find the video_id associated with the video that matched our Chroma query.

video_id = pegasus_video_ids[found_full_video_name]
print(video_id)

Next, we ask Pegasus what the ingredients are for birria tacos. It is exactly the same as the query we used before, but this time we are chatting with the entire video instead of just a 6 second segment.

res = twelvelabs_client.generate.text(
  video_id=video_id,
  prompt=query
)
full_video_answer = res.data
print(f"query {query}")
print(f"{full_video_answer}")


Compare full video answer to segment answer

We can then compare this to the answer given when we chatted with the 6 second segment.

print(f"segment answer: \n{segment_answer}")


13 - Use an Open Source Model to Chat with a Full Video

Now we will compare Pegasus to the open source model when chatting with a full video.

After we sample all of the videos again, we can run our model on the full video, which outputs some more interesting answers.

# Sample all of the videos:
sampled_database_video_info = process_videos_in_folder(upscaled_video_dir)

# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": query},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
prompt_len = len(prompt)

inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device)
generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)

print(generated_text[0])


Compare Result to Pegasus Answer

As we can see the open source model cannot give us an answer when chatting with the entire video either, showing the power of Pegasus:

print(f"Pegasus answer: \n{full_video_answer}")


14 - Comparison

Comparing our two models, it is clear that Pegasus has a clear understanding of our video and is able to give us an accurate response to our query. It understands that this is a tutorial for birria tacos and is able to give us the ingredient list. It is able to do this with the entire video and video segments.

The LLaVA-NeXT-Video open-source model understands our query, but is not able to extract the relevant information from the video. LLaVA-NeXT-Video does have some understanding that this is a cooking video, but does not have the richness in understanding needed for our query.


15 - Conclusion

This guide demonstrated how to interact with video content, either individually or as a complete set. It utilized the TwelveLabs Embed API and Chroma’s vector database for managing retrieval running locally on your machine.

We also compared TwelveLabs' Pegasus model with the LLaVA-NeXT-Video open-source model, evaluating the required infrastructure, developer experience, and query results. Compared to the open-source model, Pegasus showed promise with less operational overhead, better instruction following, and larger context lengths for easier querying of longer videos.


Appendix

For your reference and further exploration:

  1. Complete Colab Notebook

  2. TwelveLabs Documentation

  3. Chroma’s Client and Documentation

Big thanks to Jeff Huber and Itai Smith from the Chroma team for reviewing the draft!


Welcome to this tutorial on integrating TwelveLabs' Embed API with Chroma’s Vector Database for RAG-based Q&A on videos. In this guide, you'll learn how to extract text answers from an unstructured database of videos using generative models.

We combine TwelveLabs' rich, contextual embeddings with Chroma’s vector database to store, index, and query these video embeddings, creating a chat application. This notebook demonstrates the current possibilities of these technologies with just a few lines of code.

For comparison, we also showcase the difference in developer experience between using TwelveLabs' Generate API to generate text responses and a leading open-source model, LLaVA-NeXT-Video.


1 - Overview

In this tutorial we will go through several examples where we use TwelveLabs Marengo to create video embeddings, use Chroma to store and query those embeddings to find relevant videos, and use TwelveLabs Pegasus and LLaVA-NeXT-Video to chat with the returned videos.

We will:

  1. Create Video Embeddings Using the TwelveLabs Marengo Engine

  2. Store Video Embeddings in a Chroma Database

  3. Query Embeddings in our Chroma Database to Find Relevant Video Segments

  4. Use TwelveLabs Pegasus to Chat with the Returned Video Segment

  5. Use an Open Source Model to Chat with the Returned Video Segment

  6. Compare Pegasus to the Open Source model

  7. Use Chroma and TwelveLabs Embeddings to Search Multiple Videos

  8. Use Pegasus to Chat with a Full Video

  9. Use an Open Source Model to Chat with a Full Video


2 - Setup and Installation


Install Required Libraries

First, we will install the TwelveLabs and Chroma SDKs.

# Install Twelve Labs and Chroma libraries
!pip install --upgrade twelvelabs
!pip install --upgrade chromadb

Next, we will install the libraries to be used to run our open source model.

# Install libraries for use with the open source model
!pip install protobuf==3.20.3
!pip install --upgrade -q accelerate bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git
!pip install av

If you are running this tutorial outside Colab, uncomment the cell below to install the libraries used to view the video data we are working with.

# Extra Things to install if you're not on colab
# !python -m pip install pillow
# !python -m pip install sentencepiece
# !python -m pip install matplotlib


3 - Preparing the Video Data

Next, we will prepare our video data.


Using our video Data

This demo uses video data from a TwelveLabs Google Drive folder. To use it, you'll need to link the folder to your Google Drive, and then mount your Google Drive to this Colab.


Linking the folder to our Google Drive

Anyone can access the folder with this link: https://drive.google.com/drive/folders/1k6FmkVglFsdtJG4MTIK-2dk1Dk9gTPtu?usp=share_link

To link this to the correct spot in your Google Drive:

  1. Go to "Shared with me" in Google Drive.

  2. Locate the shared folder you want to access.

  3. Select "Organize" -> "Add Shortcut"

  4. Choose "My Drive" as the destination and click "Add".

Now this folder should be accessible at /content/drive/MyDrive/TwelveLabs-Chroma


Mounting Drive

Here we mount your drive to this Colab.

from google.colab import drive
drive.mount('/content/drive')


Set Video Path

Then we set the path for the videos we will be working with. This may change based on where you link the video folder.

video_folder_path = "/content/drive/MyDrive/TwelveLabs-Chroma/videos/"


Upscale Video Resolution

Some of our videos are too low resolution to use in the embedding engine, so we will double their resolution with upscale_video.

import numpy as np
import subprocess
import os

def upscale_video(input_file, output_path, target_width=854, target_height=480):

    output_file = os.path.join(output_path, os.path.basename(input_file))

    if os.path.exists(output_file):
        print(f"Skipping {input_file} as {output_file} already exists.")
        return
 

    """
    Upscale a video to the target width and height using FFmpeg.

    Args:
        input_file (str): Path to the input video file.
        output_file (str): Path to save the upscaled video.
        target_width (int): Desired output width. Default is 854.
        target_height (int): Desired output height. Default is 480.
    """
    # FFmpeg command to upscale the video
    ffmpeg_command = [
        'ffmpeg',
        '-i', input_file,                              # Input file
        '-vf', f'scale={target_width}:{target_height}', # Scale filter with target dimensions
        '-c:a', 'copy',                                # Copy audio stream without re-encoding
        output_file,                                    # Output file
        "-y"
    ]

    # Run the FFmpeg command
    subprocess.run(ffmpeg_command)

    print(f"Upscaled video saved to {output_file}")

First, we create the path where we will save the upscaled videos.

upscaled_video_dir = video_folder_path + "upscaled_videos/"

Then, we will upscale the videos. This will skip the videos we have already upscaled and placed in our upscaled video folder.

# Upscale all .mp4 videos
# Create output directory if it doesn't exist
if not os.path.exists(upscaled_video_dir):
    os.makedirs(upscaled_video_dir)

# Iterate over all files in the raw video directory
for filename in os.listdir(video_folder_path):
    # Check if the file is a video file
    input_filepath = os.path.join(video_folder_path, filename)
    if filename.endswith(".mp4"):
        upscale_video(input_filepath, upscaled_video_dir)
# Output
...
Upscaled video saved to ../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4
...
Upscaled video saved to ../videos/upscaled_videos/How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4
...
Upscaled video saved to ../videos/upscaled_videos/How To Make Potato Wedges [eZXbMWPJkKQ]


4 - Create Video Embeddings Using the TwelveLabs Marengo Engine

Here we will use the TwelveLabs Marengo Engine to create embeddings for our video.

We will start by loading our TwelveLabs API key. This can be stored in your Colab secrets store which can be found by clicking the “Secrets” (Key Icon) on the left panel of your Colab and then clicking “+ Add new secret.”

from google.colab import userdata
TL_API_KEY=userdata.get('TL_API_KEY')

Next we initiate the TwelveLabs and Chroma clients. The Chroma database in this tutorial is stored locally.

from twelvelabs import TwelveLabs
from twelvelabs.models.embed import EmbeddingsTask

# Initialize the Twelve Labs client
twelvelabs_client = TwelveLabs(api_key=TL_API_KEY)

Here we use the Ephemeral Client, which is great for experimenting. It is run in memory, but data is not persisted after the application closes.

import chromadb

# Initialize Ephemeral Chroma Client
chroma_client = chromadb.Client()

Alternate Clients

If you need persistent data, you have two options:

1 - Persistent Client: This will save and load your database to your local disk.

# Persistent Client Option
chroma_client = chromadb.PersistentClient(path="/path/to/save/to")

2 - HTTP Client: This will connect to a Chroma server running in a separate process - either self deployed, or on Chroma Cloud.

# Http Client
chroma_client = chromadb.HttpClient(host='localhost', port=8000)


Create Video Embeddings and Format for Chroma

Here we create video embeddings using Marengo and format for Chroma. To upload data to Chroma you need three separate lists for all the data that you want to upload: embeddings, meta-datas, and ids.

def on_task_update(task: EmbeddingsTask):
    print(f"  Status={task.status}")

# Create video embeddings and format for Chroma
def create_video_embeddings(client,video_file,segment_length,task_id=None):

    #upload video to twelve labs if it does not already exist
    video_name = os.path.basename(video_file)

    if task_id == None or task_id == "":
        task = client.embed.task.create(
            engine_name="Marengo-retrieval-2.7",
            video_file=video_file,
            video_clip_length=segment_length
        )
        print(
            f"Created task: id={task.id} engine_name={task.engine_name} status={task.status}"
        )

        status = task.wait_for_done(
            sleep_interval=2,
            callback=on_task_update
        )

        print(f"Embedding done: {status}")

        task_id = task.id

    #fetch embeddings
    task = client.embed.task.retrieve(task_id)
    print("task",task)

    #format for chroma
    embeddings = []
    metadatas = []
    ids = []

    idx = 0

    print("embeddings",task.video_embeddings)

    if task.video_embeddings is not None:
        for v in task.video_embeddings:

            metadata = {
                "embedding_scope":v.embedding_scope,
                "start_offset_sec":v.start_offset_sec,
                "end_offset_sec":v.end_offset_sec,
                "video_file":video_file,
                "video_name":video_name,
                "task_id":task.id,
                "video_segment_number":idx
            }


            embedding = v.values
            id = task.id + "_" + str(idx)

            metadatas.append(metadata)
            embeddings.append(embedding)
            ids.append(id)

            idx += 1

    return (ids,metadatas,embeddings,task_id)

Next, we select the video we will be working with and set a 6 second embedding segment duration - each embedding will refer to a 6 second segment in the video.

# Set the segment duration and the video we will be working with
segment_duration = 6
current_video_path = upscaled_video_dir + "How To Make Birria Tacos [4nIFJFgH99w].mp4"

Then, we retrieve our embeddings data to be uploaded to Chroma. We also get our embeddings task_id from TwelveLabs. This task _id can be used to retrieve embeddings for this video in the future so we don’t have to run the embedding model multiple times for the same video.

# Get embeddings to upload to Chroma

# Set task_id if you already have one, otherwise set to empty string
task_id = ""
ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,current_video_path,segment_duration,task_id)

You can find the task_id in the output.



5 - Store Video Embeddings in a Chroma Database

Now that we have our records and vectors in an easy format, we can simply add them to a new collection in Chroma (or, an existing collection if it already exists).

# Fetch or create a Chroma Collection
chroma_collection_name = "video_embeddings"
collection = chroma_client.get_or_create_collection(chroma_collection_name)

# Add embeddings and metadata to our collection
collection.add(
    metadatas = metadatas,
    embeddings = embeddings,
    ids=ids
)


6 - Query Embeddings in our Chroma Database to Find Relevant Video Segments


Testing the Vector Search

Now that we have everything in the collection, we can test and see that the embeddings query works. We will search by the first returned embedding, which should have a distance of zero to itself.

# Use first embedding as a test search
test_segment_embeddings = embeddings[0]

results = collection.query(
    query_embeddings=[test_segment_embeddings],
    n_results=4
)

print("search embeddings for:",ids[0])
print("found:", results["ids"][0][0])
print("distance:",results["distances"][0][0])

#assert that the first video's text embedding is distance 0 from itself
assert results["ids"][0][0] == ids[0]
assert results["distances"][0][0] == 0


Querying our Vector Database

Now we will show the power of TwelveLabs embedding and the Chroma vector database to efficiently query a large number of videos.

The video we chose earlier is a cooking tutorial for birria tacos. We will use TwelveLabs and Chroma to figure out what ingredients we need to create birria tacos.

We set that query here.

query = "What are the ingredients for birria tacos?"

Next, we create a function that uses TwelveLabs Marengo to embed a text query and then query our chroma collection using that embedding.

import os

def query_chroma(collection,query,n_results=1):
    #Create embedding for query
    embedding = twelvelabs_client.embed.create(
        engine_name="Marengo-retrieval-2.7",
        text=query,
        text_truncate="start",
    )

    query_embeddings = embedding.text_embedding.float

    #Search Chroma database with query embedding

    response = collection.query(
        query_embeddings=query_embeddings,
        n_results=n_results,
    )


    return response

Then, we run our Chroma query to find the 6 second segment of our tutorial that best answers this query. Later in this tutorial we will pass this segment into TwelveLabs Pegasus and an open source model to ask it questions including which ingredients we need.

Our metadata has information on the segment and the video it came from.

response = query_chroma(collection,query)

# Print the properties and distance of the most similar object
print(response["ids"][0][0])
print(response["distances"][0][0])
print(response["metadatas"][0][0])

# Get the path for the found video segment for the next step
found_video_metadata = response["metadatas"][0][0]
# Output

672101d56025850d8c890d1c_6
1.3052971363067627
{'embedding_scope': 'clip', 
'end_offset_sec': 42.0, 
'start_offset_sec': 36.0, 
'task_id': '672101d56025850d8c890d1c', 
'video_file': '../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4', 
'video_name': 'How To Make Birria Tacos [4nIFJFgH99w]


7 - Splitting Videos into Segments

While we could upload the entire tutorial video to Pegasus, we will only upload the relevant 6 second segment we found in the previous step. This allows us to be efficient with compute.

Here we split the video we are working with into segments that match our embeddings segments and put them into the split videos folder.

split_video_dir = video_folder_path + "split_videos/"

def split_video(input_path, output_dir, segment_duration=6):

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    filename = os.path.splitext(os.path.basename(input_path))[0]
    filetype = os.path.splitext(os.path.basename(input_path))[1]

    # Split video into segments
    ffmpeg_command = [
        'ffmpeg',
        '-i', input_path,             # Input video file
        '-c', 'copy',                  # Copy both video and audio codecs
        '-f', 'segment',               # Segment mode
        '-segment_time', str(segment_duration),  # Segment length
        '-reset_timestamps', '1',      # Reset timestamps for each segment
        output_dir + filename + '_%03d' + filetype  # Output filename pattern (e.g., output_001.mp4)
    ]

    # Run the command
    subprocess.run(ffmpeg_command)

    print("Video split into 6-second segments successfully.")

# Split the video into segments
split_video(input_path=current_video_path, output_dir=split_video_dir, segment_duration=segment_duration)
# Output

...
[segment @ 0x561c0a7b4800] Opening '../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]_001.mp4' for writing
...
[segment @ 0x561c0a7b4800] Opening '../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]


8 - Use TwelveLabs Pegasus to Chat with the Returned Video Segment

These next few cells will show us how simple it is to use Pegasus to chat with a video -- Everything comes ready out of the box.


Uploading Video Segment to Pegasus

First, we will create an index for our video uploads and the Pegasus Engine, then upload them.

# Create or retrieve pegasus index
engines = [
        {
            "name": "pegasus1.2",
            "options": ["visual", "conversation"]
        }
    ]

index_name = "cooking_video_index"
indices_list = twelvelabs_client.index.list(name=index_name)

if len(indices_list) == 0:
    index = twelvelabs_client.index.create(
        name=index_name,
        engines=engines,

    )
    print(f"A new index has been created: id={index.id} name={index.name} engines={index.engines}")
else:
    index = indices_list[0]
    print(f"Index already exists: id={index.id} name={index.name} engines={index.engines}")
# Output

A new index has been created: id=672120b06025850d8c890db2 
name=cooking_video_index 
engines=root=[Engine(name='pegasus1.2', options=['visual', 'conversation']


Get Video Segment File Name

Next, we will use the metadata from our Chroma query to find the segment that matched our query.

# Get video segment filename
found_video_segment_number = int(found_video_metadata["video_segment_number"])
found_video_file = found_video_metadata["video_file"]
found_video_filename = os.path.splitext(os.path.basename(found_video_file))[0]
found_video_filetype = os.path.splitext(os.path.basename(found_video_file))[1]
found_video_segment_filename = found_video_filename + f"_{found_video_segment_number:03d}"

found_video_segment_path = split_video_dir + found_video_segment_filename + found_video_filetype
print(found_video_segment_path)
# Output

../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]


Upload Video to Pegasus and Get Video ID

We then create a function to upload our video segment to TwelveLabs Pegasus.

This function returns a video_id that we will use to chat with the video. We can store this video_id for multi-turn chat.

def upload_video_to_twelve_labs(index,video_path):

    #upload our video to our twelve labs index
    task = twelvelabs_client.task.create(
        index_id=index.id,
        file = video_path
    )
    print(f"Task created: id={task.id} status={task.status}")

    task.wait_for_done(sleep_interval=5, callback=on_task_update)

    if task.status != "ready":
      raise RuntimeError(f"Indexing failed with status {task.status}")
    print(f"The unique identifer of your video is {task.video_id}.")

    #return the video id
    return task.video_id

If we already have a video id for this video segment, we can store it here.

# Set video_id if you already have one, otherwise set to empty string
video_id = ""

# Upload video to get video id to chat with in Pegasus
if video_id == "":
    video_id = upload_video_to_twelve_labs(index,found_video_segment_path)

You can find the video_id in the output.



Calling Pegasus

Here we query the video segment with the query we used to find it. Because TwelveLabs handles all of the boilerplate behind the scenes, we can call our model with a simple function.


9 - Use an Open Source Model to Chat with the Returned Video Segment

Now we will use an open source model to chat with our video segment and compare it to Pegasus.

First, we need to sample the videos ourselves for the model to consume. We'll modify the LLaVa-NeXT-Video Sampling code to get a uniform sample of 8 frames for each video.

And we can do this for all of the video segments in our folder.

read_video_pyav comes directly from the LLaVa-NeXT-Video Colab notebook and it formats videos in the correct numpy representation for inference.

import av
def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

def sample_video(video_path, num_samples=8):
    container = av.open(video_path)

    # sample uniformly num_samples frames from the video
    total_frames = container.streams.video[0].frames
    indices = np.arange(0, total_frames, total_frames / num_samples).astype(int)

    sampled_frames = read_video_pyav(container, indices)

    return sampled_frames

def process_videos_in_folder(folder_path):
    sample_info = {}

    # Supported video file extensions
    video_extensions = ('.mp4', '.avi', '.mov', '.mkv')

    for filename in os.listdir(folder_path):
        simple_video_name = os.path.splitext(os.path.basename(filename))[0]
        if filename.lower().endswith(video_extensions):
            video_path = os.path.join(folder_path, filename)
            try:
                sampled_clip = sample_video(video_path)
                sample_info[simple_video_name] = {"sampled_video": sampled_clip, "video_path" : video_path}
            except Exception as e:
                print(f"Error processing {filename}: {str(e)}")

    return sample_info

sampled_video_info = process_videos_in_folder(split_video_dir)

# Get video segment found in our Chroma query
video_segment = sampled_video_info[found_video_segment_filename]['sampled_video']


Setting up our Model

We'll set up our model in 4-bit quantization to speed up inference.

from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf",
    quantization_config=quantization_config,
    device_map='auto'
)

# To use later to play the videos in the notebook itself

from matplotlib import pyplot as plt
from matplotlib import animation
from IPython.display import HTML

# np array with shape (frames, height, width, channels)
# pick one at random just to see
video = sampled_video_info[list(sampled_video_info.keys())[0]]['sampled_video']

fig = plt.figure()
im = plt.imshow(video[0,:,:,:])

plt.close() # this is required to not display the generated image

def init():
    im.set_data(video[0,:,:,:])

def animate(i):
    im.set_data(video[i,:,:,:])
    return im

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0],
                               interval=100)
HTML(anim.to_html5_video())


Running the Model

Now that we have our query and the relevant video, we can feed them into the model to get an output.

# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": query},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
prompt_len = len(prompt)

inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device)
generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
open_source_segment_generated_text = processor.batch_decode(output, skip_special_tokens=True)

print(open_source_segment_generated_text[0])


10 - Compare Pegasus to the Open Source model

As we can see Pegasus does a better job at answering our query (printed out again for convenience).

print(f"query {query}")
print("pegasus answer")
print(segment_answer)
print("open source answer")
print(open_source_segment_generated_text[0])


11 - Using Chroma and TwelveLabs Embeddings to Search Multiple Videos

We have already shown how we can use TwelveLabs Marengo and Chroma can be used to find relevant segments within a video.

Now we can show the true power of Marengo and Chroma for a RAG use case — by querying a large set of potential videos to find the specific video that we need to answer the query.


Embedding our Video Database:

First we will embed all of our videos and store those embeddings in Chroma:

# Embed and store task ID's for all videos
chroma_collection_name = "video_embeddings"
collection = chroma_client.get_or_create_collection(chroma_collection_name)

Then, we create a dictionary to store task_ids for each video in case we need to retrieve these embeddings from TwelveLabs in the future.

# Store twelve labs task ids for each video
task_ids = {}

Next, we get the embeddings for each video using the create_video_embeddings function we created earlier and upload them to our Chroma collection.

# Get embeddings and metadata for each video
# Store task ids so we don't upload videos multiple times
for filename in os.listdir(upscaled_video_dir):

    if filename.endswith(".mp4"):

        if (filename in task_ids.keys()):
            task_id = task_ids[filename]
        else:
            task_id = None

        file_path = os.path.join(upscaled_video_dir, filename)

        ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,file_path,segment_duration,task_id)

        task_ids[filename] = task_id

        collection.add(
            metadatas = metadatas,
            embeddings = embeddings,
            ids=ids
        )

print(task_ids)
# Output

{'How To Make Birria Tacos [4nIFJFgH99w].mp4': '67211a5430c547875065b3f7', 
"How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4": '67211ab56025850d8c890d93', 
'How To Make Potato Wedges [eZXbMWPJkKQ]


Querying our Database

Here we use the same query as before to find the full video that will answer our query.

response = query_chroma(collection,query)
found_full_video_name = response["metadatas"][0][0]["video_name"]
print(found_full_video_name)
# Output

How To Make Birria Tacos [4nIFJFgH99w]


12 - Use Pegasus to Chat with a Full Video

We already have an index created, so we just need to upload the videos to this index then call Pegasus.

# Store pegasus video ids so that we don't upload videos multiple times
pegasus_video_ids = {}

We upload each video here and store the video ids we will use to chat with the video.

for upscaled_video in os.listdir(upscaled_video_dir):
    upscaled_video_path = os.path.join(upscaled_video_dir, upscaled_video)
    print(upscaled_video_path)
    if upscaled_video not in pegasus_video_ids:
        video_id = upload_video_to_twelve_labs(index,upscaled_video_path)
        pegasus_video_ids[upscaled_video] = video_id
# Output

../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4
Task created: id=67211b6b30c547875065b407 status=validating
  Status=pending
...
  Status=indexing
...
  Status=ready
The unique identifer of your video is 67211b6c1a5f2afde113615e.
../videos/upscaled_videos/How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4
Task created: id=67211c446025850d8c890da5 status=validating
  Status=pending
...
  Status=indexing
...
  Status=ready
The unique identifer of your video is 67211c451a5f2afde113615f.
../videos/upscaled_videos/How To Make Potato Wedges [eZXbMWPJkKQ]

print(pegasus_video_ids)
# Output

{'How To Make Birria Tacos [4nIFJFgH99w].mp4': '67211b6c1a5f2afde113615e', 
"How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4": '67211c451a5f2afde113615f', 
'How To Make Potato Wedges [eZXbMWPJkKQ]


Calling Pegasus to Chat with Full Video

First, we find the video_id associated with the video that matched our Chroma query.

video_id = pegasus_video_ids[found_full_video_name]
print(video_id)

Next, we ask Pegasus what the ingredients are for birria tacos. It is exactly the same as the query we used before, but this time we are chatting with the entire video instead of just a 6 second segment.

res = twelvelabs_client.generate.text(
  video_id=video_id,
  prompt=query
)
full_video_answer = res.data
print(f"query {query}")
print(f"{full_video_answer}")


Compare full video answer to segment answer

We can then compare this to the answer given when we chatted with the 6 second segment.

print(f"segment answer: \n{segment_answer}")


13 - Use an Open Source Model to Chat with a Full Video

Now we will compare Pegasus to the open source model when chatting with a full video.

After we sample all of the videos again, we can run our model on the full video, which outputs some more interesting answers.

# Sample all of the videos:
sampled_database_video_info = process_videos_in_folder(upscaled_video_dir)

# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": query},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
prompt_len = len(prompt)

inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device)
generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)

print(generated_text[0])


Compare Result to Pegasus Answer

As we can see the open source model cannot give us an answer when chatting with the entire video either, showing the power of Pegasus:

print(f"Pegasus answer: \n{full_video_answer}")


14 - Comparison

Comparing our two models, it is clear that Pegasus has a clear understanding of our video and is able to give us an accurate response to our query. It understands that this is a tutorial for birria tacos and is able to give us the ingredient list. It is able to do this with the entire video and video segments.

The LLaVA-NeXT-Video open-source model understands our query, but is not able to extract the relevant information from the video. LLaVA-NeXT-Video does have some understanding that this is a cooking video, but does not have the richness in understanding needed for our query.


15 - Conclusion

This guide demonstrated how to interact with video content, either individually or as a complete set. It utilized the TwelveLabs Embed API and Chroma’s vector database for managing retrieval running locally on your machine.

We also compared TwelveLabs' Pegasus model with the LLaVA-NeXT-Video open-source model, evaluating the required infrastructure, developer experience, and query results. Compared to the open-source model, Pegasus showed promise with less operational overhead, better instruction following, and larger context lengths for easier querying of longer videos.


Appendix

For your reference and further exploration:

  1. Complete Colab Notebook

  2. TwelveLabs Documentation

  3. Chroma’s Client and Documentation