Partnerships
Partnerships
Partnerships
Multimodal RAG: Chat with Videos Using TwelveLabs and Chroma


James Le
James Le
James Le
This guide demonstrated how to interact with video content, either individually or as a complete set. It utilized the TwelveLabs Embed API and Chroma’s vector database for managing retrieval running locally on your machine.
This guide demonstrated how to interact with video content, either individually or as a complete set. It utilized the TwelveLabs Embed API and Chroma’s vector database for managing retrieval running locally on your machine.


Join our newsletter
Receive the latest advancements, tutorials, and industry insights in video understanding
Mar 21, 2025
Mar 21, 2025
Mar 21, 2025
20 Min
20 Min
20 Min
Copy link to article
Copy link to article
Copy link to article
Big thanks to Jeff Huber and Itai Smith from the Chroma team for reviewing the draft!
Welcome to this tutorial on integrating TwelveLabs' Embed API with Chroma’s Vector Database for RAG-based Q&A on videos. In this guide, you'll learn how to extract text answers from an unstructured database of videos using generative models.
We combine TwelveLabs' rich, contextual embeddings with Chroma’s vector database to store, index, and query these video embeddings, creating a chat application. This notebook demonstrates the current possibilities of these technologies with just a few lines of code.
For comparison, we also showcase the difference in developer experience between using TwelveLabs' Generate API to generate text responses and a leading open-source model, LLaVA-NeXT-Video.
1 - Overview
In this tutorial we will go through several examples where we use TwelveLabs Marengo to create video embeddings, use Chroma to store and query those embeddings to find relevant videos, and use TwelveLabs Pegasus and LLaVA-NeXT-Video to chat with the returned videos.
We will:
Create Video Embeddings Using the TwelveLabs Marengo Engine
Store Video Embeddings in a Chroma Database
Query Embeddings in our Chroma Database to Find Relevant Video Segments
Use TwelveLabs Pegasus to Chat with the Returned Video Segment
Use an Open Source Model to Chat with the Returned Video Segment
Compare Pegasus to the Open Source model
Use Chroma and TwelveLabs Embeddings to Search Multiple Videos
Use Pegasus to Chat with a Full Video
Use an Open Source Model to Chat with a Full Video
2 - Setup and Installation
Install Required Libraries
First, we will install the TwelveLabs and Chroma SDKs.
# Install Twelve Labs and Chroma libraries !pip install --upgrade twelvelabs !pip install --upgrade chromadb
Next, we will install the libraries to be used to run our open source model.
# Install libraries for use with the open source model !pip install protobuf==3.20.3 !pip install --upgrade -q accelerate bitsandbytes !pip install git+https://github.com/huggingface/transformers.git !pip install av
If you are running this tutorial outside Colab, uncomment the cell below to install the libraries used to view the video data we are working with.
# Extra Things to install if you're not on colab # !python -m pip install pillow # !python -m pip install sentencepiece # !python -m pip install matplotlib
3 - Preparing the Video Data
Next, we will prepare our video data.
Using our video Data
This demo uses video data from a TwelveLabs Google Drive folder. To use it, you'll need to link the folder to your Google Drive, and then mount your Google Drive to this Colab.
Linking the folder to our Google Drive
Anyone can access the folder with this link: https://drive.google.com/drive/folders/1k6FmkVglFsdtJG4MTIK-2dk1Dk9gTPtu?usp=share_link
To link this to the correct spot in your Google Drive:
Go to "Shared with me" in Google Drive.
Locate the shared folder you want to access.
Select "Organize" -> "Add Shortcut"
Choose "My Drive" as the destination and click "Add".
Now this folder should be accessible at /content/drive/MyDrive/TwelveLabs-Chroma
Mounting Drive
Here we mount your drive to this Colab.
from google.colab import drive drive.mount('/content/drive')
Set Video Path
Then we set the path for the videos we will be working with. This may change based on where you link the video folder.
video_folder_path = "/content/drive/MyDrive/TwelveLabs-Chroma/videos/"
Upscale Video Resolution
Some of our videos are too low resolution to use in the embedding engine, so we will double their resolution with upscale_video
.
import numpy as np import subprocess import os def upscale_video(input_file, output_path, target_width=854, target_height=480): output_file = os.path.join(output_path, os.path.basename(input_file)) if os.path.exists(output_file): print(f"Skipping {input_file} as {output_file} already exists.") return """ Upscale a video to the target width and height using FFmpeg. Args: input_file (str): Path to the input video file. output_file (str): Path to save the upscaled video. target_width (int): Desired output width. Default is 854. target_height (int): Desired output height. Default is 480. """ # FFmpeg command to upscale the video ffmpeg_command = [ 'ffmpeg', '-i', input_file, # Input file '-vf', f'scale={target_width}:{target_height}', # Scale filter with target dimensions '-c:a', 'copy', # Copy audio stream without re-encoding output_file, # Output file "-y" ] # Run the FFmpeg command subprocess.run(ffmpeg_command) print(f"Upscaled video saved to {output_file}")
First, we create the path where we will save the upscaled videos.
upscaled_video_dir = video_folder_path + "upscaled_videos/"
Then, we will upscale the videos. This will skip the videos we have already upscaled and placed in our upscaled video folder.
# Upscale all .mp4 videos # Create output directory if it doesn't exist if not os.path.exists(upscaled_video_dir): os.makedirs(upscaled_video_dir) # Iterate over all files in the raw video directory for filename in os.listdir(video_folder_path): # Check if the file is a video file input_filepath = os.path.join(video_folder_path, filename) if filename.endswith(".mp4"): upscale_video(input_filepath, upscaled_video_dir)
# Output ... Upscaled video saved to ../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4 ... Upscaled video saved to ../videos/upscaled_videos/How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4 ... Upscaled video saved to ../videos/upscaled_videos/How To Make Potato Wedges [eZXbMWPJkKQ]
4 - Create Video Embeddings Using the TwelveLabs Marengo Engine
Here we will use the TwelveLabs Marengo Engine to create embeddings for our video.
We will start by loading our TwelveLabs API key. This can be stored in your Colab secrets store which can be found by clicking the “Secrets” (Key Icon) on the left panel of your Colab and then clicking “+ Add new secret.”
from google.colab import userdata TL_API_KEY=userdata.get('TL_API_KEY')
Next we initiate the TwelveLabs and Chroma clients. The Chroma database in this tutorial is stored locally.
from twelvelabs import TwelveLabs from twelvelabs.models.embed import EmbeddingsTask # Initialize the Twelve Labs client twelvelabs_client = TwelveLabs(api_key=TL_API_KEY)
Here we use the Ephemeral Client, which is great for experimenting. It is run in memory, but data is not persisted after the application closes.
import chromadb # Initialize Ephemeral Chroma Client chroma_client = chromadb.Client()
Alternate Clients
If you need persistent data, you have two options:
1 - Persistent Client: This will save and load your database to your local disk.
# Persistent Client Option chroma_client = chromadb.PersistentClient(path="/path/to/save/to")
2 - HTTP Client: This will connect to a Chroma server running in a separate process - either self deployed, or on Chroma Cloud.
# Http Client chroma_client = chromadb.HttpClient(host='localhost', port=8000)
Create Video Embeddings and Format for Chroma
Here we create video embeddings using Marengo and format for Chroma. To upload data to Chroma you need three separate lists for all the data that you want to upload: embeddings, meta-datas, and ids.
def on_task_update(task: EmbeddingsTask): print(f" Status={task.status}") # Create video embeddings and format for Chroma def create_video_embeddings(client,video_file,segment_length,task_id=None): #upload video to twelve labs if it does not already exist video_name = os.path.basename(video_file) if task_id == None or task_id == "": task = client.embed.task.create( engine_name="Marengo-retrieval-2.7", video_file=video_file, video_clip_length=segment_length ) print( f"Created task: id={task.id} engine_name={task.engine_name} status={task.status}" ) status = task.wait_for_done( sleep_interval=2, callback=on_task_update ) print(f"Embedding done: {status}") task_id = task.id #fetch embeddings task = client.embed.task.retrieve(task_id) print("task",task) #format for chroma embeddings = [] metadatas = [] ids = [] idx = 0 print("embeddings",task.video_embeddings) if task.video_embeddings is not None: for v in task.video_embeddings: metadata = { "embedding_scope":v.embedding_scope, "start_offset_sec":v.start_offset_sec, "end_offset_sec":v.end_offset_sec, "video_file":video_file, "video_name":video_name, "task_id":task.id, "video_segment_number":idx } embedding = v.values id = task.id + "_" + str(idx) metadatas.append(metadata) embeddings.append(embedding) ids.append(id) idx += 1 return (ids,metadatas,embeddings,task_id)
Next, we select the video we will be working with and set a 6 second embedding segment duration - each embedding will refer to a 6 second segment in the video.
# Set the segment duration and the video we will be working with segment_duration = 6 current_video_path = upscaled_video_dir + "How To Make Birria Tacos [4nIFJFgH99w].mp4"
Then, we retrieve our embeddings data to be uploaded to Chroma. We also get our embeddings task_id
from TwelveLabs. This task _id
can be used to retrieve embeddings for this video in the future so we don’t have to run the embedding model multiple times for the same video.
# Get embeddings to upload to Chroma # Set task_id if you already have one, otherwise set to empty string task_id = "" ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,current_video_path,segment_duration,task_id)
You can find the task_id
in the output.
5 - Store Video Embeddings in a Chroma Database
Now that we have our records and vectors in an easy format, we can simply add them to a new collection in Chroma (or, an existing collection if it already exists).
# Fetch or create a Chroma Collection chroma_collection_name = "video_embeddings" collection = chroma_client.get_or_create_collection(chroma_collection_name) # Add embeddings and metadata to our collection collection.add( metadatas = metadatas, embeddings = embeddings, ids=ids )
6 - Query Embeddings in our Chroma Database to Find Relevant Video Segments
Testing the Vector Search
Now that we have everything in the collection, we can test and see that the embeddings query works. We will search by the first returned embedding, which should have a distance of zero to itself.
# Use first embedding as a test search test_segment_embeddings = embeddings[0] results = collection.query( query_embeddings=[test_segment_embeddings], n_results=4 ) print("search embeddings for:",ids[0]) print("found:", results["ids"][0][0]) print("distance:",results["distances"][0][0]) #assert that the first video's text embedding is distance 0 from itself assert results["ids"][0][0] == ids[0] assert results["distances"][0][0] == 0
Querying our Vector Database
Now we will show the power of TwelveLabs embedding and the Chroma vector database to efficiently query a large number of videos.
The video we chose earlier is a cooking tutorial for birria tacos. We will use TwelveLabs and Chroma to figure out what ingredients we need to create birria tacos.
We set that query here.
query = "What are the ingredients for birria tacos?"
Next, we create a function that uses TwelveLabs Marengo to embed a text query and then query our chroma collection using that embedding.
import os def query_chroma(collection,query,n_results=1): #Create embedding for query embedding = twelvelabs_client.embed.create( engine_name="Marengo-retrieval-2.7", text=query, text_truncate="start", ) query_embeddings = embedding.text_embedding.float #Search Chroma database with query embedding response = collection.query( query_embeddings=query_embeddings, n_results=n_results, ) return response
Then, we run our Chroma query to find the 6 second segment of our tutorial that best answers this query. Later in this tutorial we will pass this segment into TwelveLabs Pegasus and an open source model to ask it questions including which ingredients we need.
Our metadata has information on the segment and the video it came from.
response = query_chroma(collection,query) # Print the properties and distance of the most similar object print(response["ids"][0][0]) print(response["distances"][0][0]) print(response["metadatas"][0][0]) # Get the path for the found video segment for the next step found_video_metadata = response["metadatas"][0][0]
# Output 672101d56025850d8c890d1c_6 1.3052971363067627 {'embedding_scope': 'clip', 'end_offset_sec': 42.0, 'start_offset_sec': 36.0, 'task_id': '672101d56025850d8c890d1c', 'video_file': '../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4', 'video_name': 'How To Make Birria Tacos [4nIFJFgH99w]
7 - Splitting Videos into Segments
While we could upload the entire tutorial video to Pegasus, we will only upload the relevant 6 second segment we found in the previous step. This allows us to be efficient with compute.
Here we split the video we are working with into segments that match our embeddings segments and put them into the split videos folder.
split_video_dir = video_folder_path + "split_videos/" def split_video(input_path, output_dir, segment_duration=6): if not os.path.exists(output_dir): os.makedirs(output_dir) filename = os.path.splitext(os.path.basename(input_path))[0] filetype = os.path.splitext(os.path.basename(input_path))[1] # Split video into segments ffmpeg_command = [ 'ffmpeg', '-i', input_path, # Input video file '-c', 'copy', # Copy both video and audio codecs '-f', 'segment', # Segment mode '-segment_time', str(segment_duration), # Segment length '-reset_timestamps', '1', # Reset timestamps for each segment output_dir + filename + '_%03d' + filetype # Output filename pattern (e.g., output_001.mp4) ] # Run the command subprocess.run(ffmpeg_command) print("Video split into 6-second segments successfully.") # Split the video into segments split_video(input_path=current_video_path, output_dir=split_video_dir, segment_duration=segment_duration)
# Output ... [segment @ 0x561c0a7b4800] Opening '../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]_001.mp4' for writing ... [segment @ 0x561c0a7b4800] Opening '../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]
8 - Use TwelveLabs Pegasus to Chat with the Returned Video Segment
These next few cells will show us how simple it is to use Pegasus to chat with a video -- Everything comes ready out of the box.
Uploading Video Segment to Pegasus
First, we will create an index for our video uploads and the Pegasus Engine, then upload them.
# Create or retrieve pegasus index engines = [ { "name": "pegasus1.2", "options": ["visual", "conversation"] } ] index_name = "cooking_video_index" indices_list = twelvelabs_client.index.list(name=index_name) if len(indices_list) == 0: index = twelvelabs_client.index.create( name=index_name, engines=engines, ) print(f"A new index has been created: id={index.id} name={index.name} engines={index.engines}") else: index = indices_list[0] print(f"Index already exists: id={index.id} name={index.name} engines={index.engines}")
# Output
A new index has been created: id=672120b06025850d8c890db2
name=cooking_video_index
engines=root=[Engine(name='pegasus1.2', options=['visual', 'conversation']
Get Video Segment File Name
Next, we will use the metadata from our Chroma query to find the segment that matched our query.
# Get video segment filename found_video_segment_number = int(found_video_metadata["video_segment_number"]) found_video_file = found_video_metadata["video_file"] found_video_filename = os.path.splitext(os.path.basename(found_video_file))[0] found_video_filetype = os.path.splitext(os.path.basename(found_video_file))[1] found_video_segment_filename = found_video_filename + f"_{found_video_segment_number:03d}" found_video_segment_path = split_video_dir + found_video_segment_filename + found_video_filetype print(found_video_segment_path)
# Output
../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]
Upload Video to Pegasus and Get Video ID
We then create a function to upload our video segment to TwelveLabs Pegasus.
This function returns a video_id
that we will use to chat with the video. We can store this video_id
for multi-turn chat.
def upload_video_to_twelve_labs(index,video_path): #upload our video to our twelve labs index task = twelvelabs_client.task.create( index_id=index.id, file = video_path ) print(f"Task created: id={task.id} status={task.status}") task.wait_for_done(sleep_interval=5, callback=on_task_update) if task.status != "ready": raise RuntimeError(f"Indexing failed with status {task.status}") print(f"The unique identifer of your video is {task.video_id}.") #return the video id return task.video_id
If we already have a video id for this video segment, we can store it here.
# Set video_id if you already have one, otherwise set to empty string video_id = "" # Upload video to get video id to chat with in Pegasus if video_id == "": video_id = upload_video_to_twelve_labs(index,found_video_segment_path)
You can find the video_id
in the output.
Calling Pegasus
Here we query the video segment with the query we used to find it. Because TwelveLabs handles all of the boilerplate behind the scenes, we can call our model with a simple function.
9 - Use an Open Source Model to Chat with the Returned Video Segment
Now we will use an open source model to chat with our video segment and compare it to Pegasus.
First, we need to sample the videos ourselves for the model to consume. We'll modify the LLaVa-NeXT-Video Sampling code to get a uniform sample of 8 frames for each video.
And we can do this for all of the video segments in our folder.
read_video_pyav
comes directly from the LLaVa-NeXT-Video Colab notebook and it formats videos in the correct numpy representation for inference.
import av def read_video_pyav(container, indices): ''' Decode the video with PyAV decoder. Args: container (av.container.input.InputContainer): PyAV container. indices (List[int]): List of frame indices to decode. Returns: np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3). ''' frames = [] container.seek(0) start_index = indices[0] end_index = indices[-1] for i, frame in enumerate(container.decode(video=0)): if i > end_index: break if i >= start_index and i in indices: frames.append(frame) return np.stack([x.to_ndarray(format="rgb24") for x in frames]) def sample_video(video_path, num_samples=8): container = av.open(video_path) # sample uniformly num_samples frames from the video total_frames = container.streams.video[0].frames indices = np.arange(0, total_frames, total_frames / num_samples).astype(int) sampled_frames = read_video_pyav(container, indices) return sampled_frames def process_videos_in_folder(folder_path): sample_info = {} # Supported video file extensions video_extensions = ('.mp4', '.avi', '.mov', '.mkv') for filename in os.listdir(folder_path): simple_video_name = os.path.splitext(os.path.basename(filename))[0] if filename.lower().endswith(video_extensions): video_path = os.path.join(folder_path, filename) try: sampled_clip = sample_video(video_path) sample_info[simple_video_name] = {"sampled_video": sampled_clip, "video_path" : video_path} except Exception as e: print(f"Error processing {filename}: {str(e)}") return sample_info sampled_video_info = process_videos_in_folder(split_video_dir) # Get video segment found in our Chroma query video_segment = sampled_video_info[found_video_segment_filename]['sampled_video']
Setting up our Model
We'll set up our model in 4-bit quantization to speed up inference.
from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor import torch quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 ) processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf") model = LlavaNextVideoForConditionalGeneration.from_pretrained( "llava-hf/LLaVA-NeXT-Video-7B-hf", quantization_config=quantization_config, device_map='auto' ) # To use later to play the videos in the notebook itself from matplotlib import pyplot as plt from matplotlib import animation from IPython.display import HTML # np array with shape (frames, height, width, channels) # pick one at random just to see video = sampled_video_info[list(sampled_video_info.keys())[0]]['sampled_video'] fig = plt.figure() im = plt.imshow(video[0,:,:,:]) plt.close() # this is required to not display the generated image def init(): im.set_data(video[0,:,:,:]) def animate(i): im.set_data(video[i,:,:,:]) return im anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0], interval=100) HTML(anim.to_html5_video())
Running the Model
Now that we have our query and the relevant video, we can feed them into the model to get an output.
# Each "content" is a list of dicts and you can add image/video/text modalities conversation = [ { "role": "user", "content": [ {"type": "text", "text": query}, {"type": "video"}, ], }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) prompt_len = len(prompt) inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device) generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9} output = model.generate(**inputs, **generate_kwargs) open_source_segment_generated_text = processor.batch_decode(output, skip_special_tokens=True) print(open_source_segment_generated_text[0])
10 - Compare Pegasus to the Open Source model
As we can see Pegasus does a better job at answering our query (printed out again for convenience).
print(f"query {query}") print("pegasus answer") print(segment_answer) print("open source answer") print(open_source_segment_generated_text[0])
11 - Using Chroma and TwelveLabs Embeddings to Search Multiple Videos
We have already shown how we can use TwelveLabs Marengo and Chroma can be used to find relevant segments within a video.
Now we can show the true power of Marengo and Chroma for a RAG use case — by querying a large set of potential videos to find the specific video that we need to answer the query.
Embedding our Video Database:
First we will embed all of our videos and store those embeddings in Chroma:
# Embed and store task ID's for all videos chroma_collection_name = "video_embeddings" collection = chroma_client.get_or_create_collection(chroma_collection_name)
Then, we create a dictionary to store task_ids for each video in case we need to retrieve these embeddings from TwelveLabs in the future.
# Store twelve labs task ids for each video task_ids = {}
Next, we get the embeddings for each video using the create_video_embeddings function we created earlier and upload them to our Chroma collection.
# Get embeddings and metadata for each video # Store task ids so we don't upload videos multiple times for filename in os.listdir(upscaled_video_dir): if filename.endswith(".mp4"): if (filename in task_ids.keys()): task_id = task_ids[filename] else: task_id = None file_path = os.path.join(upscaled_video_dir, filename) ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,file_path,segment_duration,task_id) task_ids[filename] = task_id collection.add( metadatas = metadatas, embeddings = embeddings, ids=ids )
print(task_ids)
# Output {'How To Make Birria Tacos [4nIFJFgH99w].mp4': '67211a5430c547875065b3f7', "How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4": '67211ab56025850d8c890d93', 'How To Make Potato Wedges [eZXbMWPJkKQ]
Querying our Database
Here we use the same query as before to find the full video that will answer our query.
response = query_chroma(collection,query) found_full_video_name = response["metadatas"][0][0]["video_name"] print(found_full_video_name)
# Output
How To Make Birria Tacos [4nIFJFgH99w]
12 - Use Pegasus to Chat with a Full Video
We already have an index created, so we just need to upload the videos to this index then call Pegasus.
# Store pegasus video ids so that we don't upload videos multiple times pegasus_video_ids = {}
We upload each video here and store the video ids we will use to chat with the video.
for upscaled_video in os.listdir(upscaled_video_dir): upscaled_video_path = os.path.join(upscaled_video_dir, upscaled_video) print(upscaled_video_path) if upscaled_video not in pegasus_video_ids: video_id = upload_video_to_twelve_labs(index,upscaled_video_path) pegasus_video_ids[upscaled_video] = video_id
# Output ../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4 Task created: id=67211b6b30c547875065b407 status=validating Status=pending ... Status=indexing ... Status=ready The unique identifer of your video is 67211b6c1a5f2afde113615e. ../videos/upscaled_videos/How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4 Task created: id=67211c446025850d8c890da5 status=validating Status=pending ... Status=indexing ... Status=ready The unique identifer of your video is 67211c451a5f2afde113615f. ../videos/upscaled_videos/How To Make Potato Wedges [eZXbMWPJkKQ]
print(pegasus_video_ids)
# Output {'How To Make Birria Tacos [4nIFJFgH99w].mp4': '67211b6c1a5f2afde113615e', "How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4": '67211c451a5f2afde113615f', 'How To Make Potato Wedges [eZXbMWPJkKQ]
Calling Pegasus to Chat with Full Video
First, we find the video_id
associated with the video that matched our Chroma query.
video_id = pegasus_video_ids[found_full_video_name] print(video_id)
Next, we ask Pegasus what the ingredients are for birria tacos. It is exactly the same as the query we used before, but this time we are chatting with the entire video instead of just a 6 second segment.
res = twelvelabs_client.generate.text( video_id=video_id, prompt=query ) full_video_answer = res.data print(f"query {query}") print(f"{full_video_answer}")
Compare full video answer to segment answer
We can then compare this to the answer given when we chatted with the 6 second segment.
print(f"segment answer: \n{segment_answer}")
13 - Use an Open Source Model to Chat with a Full Video
Now we will compare Pegasus to the open source model when chatting with a full video.
After we sample all of the videos again, we can run our model on the full video, which outputs some more interesting answers.
# Sample all of the videos: sampled_database_video_info = process_videos_in_folder(upscaled_video_dir) # Each "content" is a list of dicts and you can add image/video/text modalities conversation = [ { "role": "user", "content": [ {"type": "text", "text": query}, {"type": "video"}, ], }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) prompt_len = len(prompt) inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device) generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9} output = model.generate(**inputs, **generate_kwargs) generated_text = processor.batch_decode(output, skip_special_tokens=True) print(generated_text[0])
Compare Result to Pegasus Answer
As we can see the open source model cannot give us an answer when chatting with the entire video either, showing the power of Pegasus:
print(f"Pegasus answer: \n{full_video_answer}")
14 - Comparison
Comparing our two models, it is clear that Pegasus has a clear understanding of our video and is able to give us an accurate response to our query. It understands that this is a tutorial for birria tacos and is able to give us the ingredient list. It is able to do this with the entire video and video segments.
The LLaVA-NeXT-Video open-source model understands our query, but is not able to extract the relevant information from the video. LLaVA-NeXT-Video does have some understanding that this is a cooking video, but does not have the richness in understanding needed for our query.
15 - Conclusion
This guide demonstrated how to interact with video content, either individually or as a complete set. It utilized the TwelveLabs Embed API and Chroma’s vector database for managing retrieval running locally on your machine.
We also compared TwelveLabs' Pegasus model with the LLaVA-NeXT-Video open-source model, evaluating the required infrastructure, developer experience, and query results. Compared to the open-source model, Pegasus showed promise with less operational overhead, better instruction following, and larger context lengths for easier querying of longer videos.
Appendix
For your reference and further exploration:
Complete Colab Notebook
Chroma’s Client and Documentation
Big thanks to Jeff Huber and Itai Smith from the Chroma team for reviewing the draft!
Welcome to this tutorial on integrating TwelveLabs' Embed API with Chroma’s Vector Database for RAG-based Q&A on videos. In this guide, you'll learn how to extract text answers from an unstructured database of videos using generative models.
We combine TwelveLabs' rich, contextual embeddings with Chroma’s vector database to store, index, and query these video embeddings, creating a chat application. This notebook demonstrates the current possibilities of these technologies with just a few lines of code.
For comparison, we also showcase the difference in developer experience between using TwelveLabs' Generate API to generate text responses and a leading open-source model, LLaVA-NeXT-Video.
1 - Overview
In this tutorial we will go through several examples where we use TwelveLabs Marengo to create video embeddings, use Chroma to store and query those embeddings to find relevant videos, and use TwelveLabs Pegasus and LLaVA-NeXT-Video to chat with the returned videos.
We will:
Create Video Embeddings Using the TwelveLabs Marengo Engine
Store Video Embeddings in a Chroma Database
Query Embeddings in our Chroma Database to Find Relevant Video Segments
Use TwelveLabs Pegasus to Chat with the Returned Video Segment
Use an Open Source Model to Chat with the Returned Video Segment
Compare Pegasus to the Open Source model
Use Chroma and TwelveLabs Embeddings to Search Multiple Videos
Use Pegasus to Chat with a Full Video
Use an Open Source Model to Chat with a Full Video
2 - Setup and Installation
Install Required Libraries
First, we will install the TwelveLabs and Chroma SDKs.
# Install Twelve Labs and Chroma libraries !pip install --upgrade twelvelabs !pip install --upgrade chromadb
Next, we will install the libraries to be used to run our open source model.
# Install libraries for use with the open source model !pip install protobuf==3.20.3 !pip install --upgrade -q accelerate bitsandbytes !pip install git+https://github.com/huggingface/transformers.git !pip install av
If you are running this tutorial outside Colab, uncomment the cell below to install the libraries used to view the video data we are working with.
# Extra Things to install if you're not on colab # !python -m pip install pillow # !python -m pip install sentencepiece # !python -m pip install matplotlib
3 - Preparing the Video Data
Next, we will prepare our video data.
Using our video Data
This demo uses video data from a TwelveLabs Google Drive folder. To use it, you'll need to link the folder to your Google Drive, and then mount your Google Drive to this Colab.
Linking the folder to our Google Drive
Anyone can access the folder with this link: https://drive.google.com/drive/folders/1k6FmkVglFsdtJG4MTIK-2dk1Dk9gTPtu?usp=share_link
To link this to the correct spot in your Google Drive:
Go to "Shared with me" in Google Drive.
Locate the shared folder you want to access.
Select "Organize" -> "Add Shortcut"
Choose "My Drive" as the destination and click "Add".
Now this folder should be accessible at /content/drive/MyDrive/TwelveLabs-Chroma
Mounting Drive
Here we mount your drive to this Colab.
from google.colab import drive drive.mount('/content/drive')
Set Video Path
Then we set the path for the videos we will be working with. This may change based on where you link the video folder.
video_folder_path = "/content/drive/MyDrive/TwelveLabs-Chroma/videos/"
Upscale Video Resolution
Some of our videos are too low resolution to use in the embedding engine, so we will double their resolution with upscale_video
.
import numpy as np import subprocess import os def upscale_video(input_file, output_path, target_width=854, target_height=480): output_file = os.path.join(output_path, os.path.basename(input_file)) if os.path.exists(output_file): print(f"Skipping {input_file} as {output_file} already exists.") return """ Upscale a video to the target width and height using FFmpeg. Args: input_file (str): Path to the input video file. output_file (str): Path to save the upscaled video. target_width (int): Desired output width. Default is 854. target_height (int): Desired output height. Default is 480. """ # FFmpeg command to upscale the video ffmpeg_command = [ 'ffmpeg', '-i', input_file, # Input file '-vf', f'scale={target_width}:{target_height}', # Scale filter with target dimensions '-c:a', 'copy', # Copy audio stream without re-encoding output_file, # Output file "-y" ] # Run the FFmpeg command subprocess.run(ffmpeg_command) print(f"Upscaled video saved to {output_file}")
First, we create the path where we will save the upscaled videos.
upscaled_video_dir = video_folder_path + "upscaled_videos/"
Then, we will upscale the videos. This will skip the videos we have already upscaled and placed in our upscaled video folder.
# Upscale all .mp4 videos # Create output directory if it doesn't exist if not os.path.exists(upscaled_video_dir): os.makedirs(upscaled_video_dir) # Iterate over all files in the raw video directory for filename in os.listdir(video_folder_path): # Check if the file is a video file input_filepath = os.path.join(video_folder_path, filename) if filename.endswith(".mp4"): upscale_video(input_filepath, upscaled_video_dir)
# Output ... Upscaled video saved to ../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4 ... Upscaled video saved to ../videos/upscaled_videos/How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4 ... Upscaled video saved to ../videos/upscaled_videos/How To Make Potato Wedges [eZXbMWPJkKQ]
4 - Create Video Embeddings Using the TwelveLabs Marengo Engine
Here we will use the TwelveLabs Marengo Engine to create embeddings for our video.
We will start by loading our TwelveLabs API key. This can be stored in your Colab secrets store which can be found by clicking the “Secrets” (Key Icon) on the left panel of your Colab and then clicking “+ Add new secret.”
from google.colab import userdata TL_API_KEY=userdata.get('TL_API_KEY')
Next we initiate the TwelveLabs and Chroma clients. The Chroma database in this tutorial is stored locally.
from twelvelabs import TwelveLabs from twelvelabs.models.embed import EmbeddingsTask # Initialize the Twelve Labs client twelvelabs_client = TwelveLabs(api_key=TL_API_KEY)
Here we use the Ephemeral Client, which is great for experimenting. It is run in memory, but data is not persisted after the application closes.
import chromadb # Initialize Ephemeral Chroma Client chroma_client = chromadb.Client()
Alternate Clients
If you need persistent data, you have two options:
1 - Persistent Client: This will save and load your database to your local disk.
# Persistent Client Option chroma_client = chromadb.PersistentClient(path="/path/to/save/to")
2 - HTTP Client: This will connect to a Chroma server running in a separate process - either self deployed, or on Chroma Cloud.
# Http Client chroma_client = chromadb.HttpClient(host='localhost', port=8000)
Create Video Embeddings and Format for Chroma
Here we create video embeddings using Marengo and format for Chroma. To upload data to Chroma you need three separate lists for all the data that you want to upload: embeddings, meta-datas, and ids.
def on_task_update(task: EmbeddingsTask): print(f" Status={task.status}") # Create video embeddings and format for Chroma def create_video_embeddings(client,video_file,segment_length,task_id=None): #upload video to twelve labs if it does not already exist video_name = os.path.basename(video_file) if task_id == None or task_id == "": task = client.embed.task.create( engine_name="Marengo-retrieval-2.7", video_file=video_file, video_clip_length=segment_length ) print( f"Created task: id={task.id} engine_name={task.engine_name} status={task.status}" ) status = task.wait_for_done( sleep_interval=2, callback=on_task_update ) print(f"Embedding done: {status}") task_id = task.id #fetch embeddings task = client.embed.task.retrieve(task_id) print("task",task) #format for chroma embeddings = [] metadatas = [] ids = [] idx = 0 print("embeddings",task.video_embeddings) if task.video_embeddings is not None: for v in task.video_embeddings: metadata = { "embedding_scope":v.embedding_scope, "start_offset_sec":v.start_offset_sec, "end_offset_sec":v.end_offset_sec, "video_file":video_file, "video_name":video_name, "task_id":task.id, "video_segment_number":idx } embedding = v.values id = task.id + "_" + str(idx) metadatas.append(metadata) embeddings.append(embedding) ids.append(id) idx += 1 return (ids,metadatas,embeddings,task_id)
Next, we select the video we will be working with and set a 6 second embedding segment duration - each embedding will refer to a 6 second segment in the video.
# Set the segment duration and the video we will be working with segment_duration = 6 current_video_path = upscaled_video_dir + "How To Make Birria Tacos [4nIFJFgH99w].mp4"
Then, we retrieve our embeddings data to be uploaded to Chroma. We also get our embeddings task_id
from TwelveLabs. This task _id
can be used to retrieve embeddings for this video in the future so we don’t have to run the embedding model multiple times for the same video.
# Get embeddings to upload to Chroma # Set task_id if you already have one, otherwise set to empty string task_id = "" ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,current_video_path,segment_duration,task_id)
You can find the task_id
in the output.
5 - Store Video Embeddings in a Chroma Database
Now that we have our records and vectors in an easy format, we can simply add them to a new collection in Chroma (or, an existing collection if it already exists).
# Fetch or create a Chroma Collection chroma_collection_name = "video_embeddings" collection = chroma_client.get_or_create_collection(chroma_collection_name) # Add embeddings and metadata to our collection collection.add( metadatas = metadatas, embeddings = embeddings, ids=ids )
6 - Query Embeddings in our Chroma Database to Find Relevant Video Segments
Testing the Vector Search
Now that we have everything in the collection, we can test and see that the embeddings query works. We will search by the first returned embedding, which should have a distance of zero to itself.
# Use first embedding as a test search test_segment_embeddings = embeddings[0] results = collection.query( query_embeddings=[test_segment_embeddings], n_results=4 ) print("search embeddings for:",ids[0]) print("found:", results["ids"][0][0]) print("distance:",results["distances"][0][0]) #assert that the first video's text embedding is distance 0 from itself assert results["ids"][0][0] == ids[0] assert results["distances"][0][0] == 0
Querying our Vector Database
Now we will show the power of TwelveLabs embedding and the Chroma vector database to efficiently query a large number of videos.
The video we chose earlier is a cooking tutorial for birria tacos. We will use TwelveLabs and Chroma to figure out what ingredients we need to create birria tacos.
We set that query here.
query = "What are the ingredients for birria tacos?"
Next, we create a function that uses TwelveLabs Marengo to embed a text query and then query our chroma collection using that embedding.
import os def query_chroma(collection,query,n_results=1): #Create embedding for query embedding = twelvelabs_client.embed.create( engine_name="Marengo-retrieval-2.7", text=query, text_truncate="start", ) query_embeddings = embedding.text_embedding.float #Search Chroma database with query embedding response = collection.query( query_embeddings=query_embeddings, n_results=n_results, ) return response
Then, we run our Chroma query to find the 6 second segment of our tutorial that best answers this query. Later in this tutorial we will pass this segment into TwelveLabs Pegasus and an open source model to ask it questions including which ingredients we need.
Our metadata has information on the segment and the video it came from.
response = query_chroma(collection,query) # Print the properties and distance of the most similar object print(response["ids"][0][0]) print(response["distances"][0][0]) print(response["metadatas"][0][0]) # Get the path for the found video segment for the next step found_video_metadata = response["metadatas"][0][0]
# Output 672101d56025850d8c890d1c_6 1.3052971363067627 {'embedding_scope': 'clip', 'end_offset_sec': 42.0, 'start_offset_sec': 36.0, 'task_id': '672101d56025850d8c890d1c', 'video_file': '../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4', 'video_name': 'How To Make Birria Tacos [4nIFJFgH99w]
7 - Splitting Videos into Segments
While we could upload the entire tutorial video to Pegasus, we will only upload the relevant 6 second segment we found in the previous step. This allows us to be efficient with compute.
Here we split the video we are working with into segments that match our embeddings segments and put them into the split videos folder.
split_video_dir = video_folder_path + "split_videos/" def split_video(input_path, output_dir, segment_duration=6): if not os.path.exists(output_dir): os.makedirs(output_dir) filename = os.path.splitext(os.path.basename(input_path))[0] filetype = os.path.splitext(os.path.basename(input_path))[1] # Split video into segments ffmpeg_command = [ 'ffmpeg', '-i', input_path, # Input video file '-c', 'copy', # Copy both video and audio codecs '-f', 'segment', # Segment mode '-segment_time', str(segment_duration), # Segment length '-reset_timestamps', '1', # Reset timestamps for each segment output_dir + filename + '_%03d' + filetype # Output filename pattern (e.g., output_001.mp4) ] # Run the command subprocess.run(ffmpeg_command) print("Video split into 6-second segments successfully.") # Split the video into segments split_video(input_path=current_video_path, output_dir=split_video_dir, segment_duration=segment_duration)
# Output ... [segment @ 0x561c0a7b4800] Opening '../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]_001.mp4' for writing ... [segment @ 0x561c0a7b4800] Opening '../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]
8 - Use TwelveLabs Pegasus to Chat with the Returned Video Segment
These next few cells will show us how simple it is to use Pegasus to chat with a video -- Everything comes ready out of the box.
Uploading Video Segment to Pegasus
First, we will create an index for our video uploads and the Pegasus Engine, then upload them.
# Create or retrieve pegasus index engines = [ { "name": "pegasus1.2", "options": ["visual", "conversation"] } ] index_name = "cooking_video_index" indices_list = twelvelabs_client.index.list(name=index_name) if len(indices_list) == 0: index = twelvelabs_client.index.create( name=index_name, engines=engines, ) print(f"A new index has been created: id={index.id} name={index.name} engines={index.engines}") else: index = indices_list[0] print(f"Index already exists: id={index.id} name={index.name} engines={index.engines}")
# Output
A new index has been created: id=672120b06025850d8c890db2
name=cooking_video_index
engines=root=[Engine(name='pegasus1.2', options=['visual', 'conversation']
Get Video Segment File Name
Next, we will use the metadata from our Chroma query to find the segment that matched our query.
# Get video segment filename found_video_segment_number = int(found_video_metadata["video_segment_number"]) found_video_file = found_video_metadata["video_file"] found_video_filename = os.path.splitext(os.path.basename(found_video_file))[0] found_video_filetype = os.path.splitext(os.path.basename(found_video_file))[1] found_video_segment_filename = found_video_filename + f"_{found_video_segment_number:03d}" found_video_segment_path = split_video_dir + found_video_segment_filename + found_video_filetype print(found_video_segment_path)
# Output
../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]
Upload Video to Pegasus and Get Video ID
We then create a function to upload our video segment to TwelveLabs Pegasus.
This function returns a video_id
that we will use to chat with the video. We can store this video_id
for multi-turn chat.
def upload_video_to_twelve_labs(index,video_path): #upload our video to our twelve labs index task = twelvelabs_client.task.create( index_id=index.id, file = video_path ) print(f"Task created: id={task.id} status={task.status}") task.wait_for_done(sleep_interval=5, callback=on_task_update) if task.status != "ready": raise RuntimeError(f"Indexing failed with status {task.status}") print(f"The unique identifer of your video is {task.video_id}.") #return the video id return task.video_id
If we already have a video id for this video segment, we can store it here.
# Set video_id if you already have one, otherwise set to empty string video_id = "" # Upload video to get video id to chat with in Pegasus if video_id == "": video_id = upload_video_to_twelve_labs(index,found_video_segment_path)
You can find the video_id
in the output.
Calling Pegasus
Here we query the video segment with the query we used to find it. Because TwelveLabs handles all of the boilerplate behind the scenes, we can call our model with a simple function.
9 - Use an Open Source Model to Chat with the Returned Video Segment
Now we will use an open source model to chat with our video segment and compare it to Pegasus.
First, we need to sample the videos ourselves for the model to consume. We'll modify the LLaVa-NeXT-Video Sampling code to get a uniform sample of 8 frames for each video.
And we can do this for all of the video segments in our folder.
read_video_pyav
comes directly from the LLaVa-NeXT-Video Colab notebook and it formats videos in the correct numpy representation for inference.
import av def read_video_pyav(container, indices): ''' Decode the video with PyAV decoder. Args: container (av.container.input.InputContainer): PyAV container. indices (List[int]): List of frame indices to decode. Returns: np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3). ''' frames = [] container.seek(0) start_index = indices[0] end_index = indices[-1] for i, frame in enumerate(container.decode(video=0)): if i > end_index: break if i >= start_index and i in indices: frames.append(frame) return np.stack([x.to_ndarray(format="rgb24") for x in frames]) def sample_video(video_path, num_samples=8): container = av.open(video_path) # sample uniformly num_samples frames from the video total_frames = container.streams.video[0].frames indices = np.arange(0, total_frames, total_frames / num_samples).astype(int) sampled_frames = read_video_pyav(container, indices) return sampled_frames def process_videos_in_folder(folder_path): sample_info = {} # Supported video file extensions video_extensions = ('.mp4', '.avi', '.mov', '.mkv') for filename in os.listdir(folder_path): simple_video_name = os.path.splitext(os.path.basename(filename))[0] if filename.lower().endswith(video_extensions): video_path = os.path.join(folder_path, filename) try: sampled_clip = sample_video(video_path) sample_info[simple_video_name] = {"sampled_video": sampled_clip, "video_path" : video_path} except Exception as e: print(f"Error processing {filename}: {str(e)}") return sample_info sampled_video_info = process_videos_in_folder(split_video_dir) # Get video segment found in our Chroma query video_segment = sampled_video_info[found_video_segment_filename]['sampled_video']
Setting up our Model
We'll set up our model in 4-bit quantization to speed up inference.
from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor import torch quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 ) processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf") model = LlavaNextVideoForConditionalGeneration.from_pretrained( "llava-hf/LLaVA-NeXT-Video-7B-hf", quantization_config=quantization_config, device_map='auto' ) # To use later to play the videos in the notebook itself from matplotlib import pyplot as plt from matplotlib import animation from IPython.display import HTML # np array with shape (frames, height, width, channels) # pick one at random just to see video = sampled_video_info[list(sampled_video_info.keys())[0]]['sampled_video'] fig = plt.figure() im = plt.imshow(video[0,:,:,:]) plt.close() # this is required to not display the generated image def init(): im.set_data(video[0,:,:,:]) def animate(i): im.set_data(video[i,:,:,:]) return im anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0], interval=100) HTML(anim.to_html5_video())
Running the Model
Now that we have our query and the relevant video, we can feed them into the model to get an output.
# Each "content" is a list of dicts and you can add image/video/text modalities conversation = [ { "role": "user", "content": [ {"type": "text", "text": query}, {"type": "video"}, ], }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) prompt_len = len(prompt) inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device) generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9} output = model.generate(**inputs, **generate_kwargs) open_source_segment_generated_text = processor.batch_decode(output, skip_special_tokens=True) print(open_source_segment_generated_text[0])
10 - Compare Pegasus to the Open Source model
As we can see Pegasus does a better job at answering our query (printed out again for convenience).
print(f"query {query}") print("pegasus answer") print(segment_answer) print("open source answer") print(open_source_segment_generated_text[0])
11 - Using Chroma and TwelveLabs Embeddings to Search Multiple Videos
We have already shown how we can use TwelveLabs Marengo and Chroma can be used to find relevant segments within a video.
Now we can show the true power of Marengo and Chroma for a RAG use case — by querying a large set of potential videos to find the specific video that we need to answer the query.
Embedding our Video Database:
First we will embed all of our videos and store those embeddings in Chroma:
# Embed and store task ID's for all videos chroma_collection_name = "video_embeddings" collection = chroma_client.get_or_create_collection(chroma_collection_name)
Then, we create a dictionary to store task_ids for each video in case we need to retrieve these embeddings from TwelveLabs in the future.
# Store twelve labs task ids for each video task_ids = {}
Next, we get the embeddings for each video using the create_video_embeddings function we created earlier and upload them to our Chroma collection.
# Get embeddings and metadata for each video # Store task ids so we don't upload videos multiple times for filename in os.listdir(upscaled_video_dir): if filename.endswith(".mp4"): if (filename in task_ids.keys()): task_id = task_ids[filename] else: task_id = None file_path = os.path.join(upscaled_video_dir, filename) ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,file_path,segment_duration,task_id) task_ids[filename] = task_id collection.add( metadatas = metadatas, embeddings = embeddings, ids=ids )
print(task_ids)
# Output {'How To Make Birria Tacos [4nIFJFgH99w].mp4': '67211a5430c547875065b3f7', "How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4": '67211ab56025850d8c890d93', 'How To Make Potato Wedges [eZXbMWPJkKQ]
Querying our Database
Here we use the same query as before to find the full video that will answer our query.
response = query_chroma(collection,query) found_full_video_name = response["metadatas"][0][0]["video_name"] print(found_full_video_name)
# Output
How To Make Birria Tacos [4nIFJFgH99w]
12 - Use Pegasus to Chat with a Full Video
We already have an index created, so we just need to upload the videos to this index then call Pegasus.
# Store pegasus video ids so that we don't upload videos multiple times pegasus_video_ids = {}
We upload each video here and store the video ids we will use to chat with the video.
for upscaled_video in os.listdir(upscaled_video_dir): upscaled_video_path = os.path.join(upscaled_video_dir, upscaled_video) print(upscaled_video_path) if upscaled_video not in pegasus_video_ids: video_id = upload_video_to_twelve_labs(index,upscaled_video_path) pegasus_video_ids[upscaled_video] = video_id
# Output ../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4 Task created: id=67211b6b30c547875065b407 status=validating Status=pending ... Status=indexing ... Status=ready The unique identifer of your video is 67211b6c1a5f2afde113615e. ../videos/upscaled_videos/How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4 Task created: id=67211c446025850d8c890da5 status=validating Status=pending ... Status=indexing ... Status=ready The unique identifer of your video is 67211c451a5f2afde113615f. ../videos/upscaled_videos/How To Make Potato Wedges [eZXbMWPJkKQ]
print(pegasus_video_ids)
# Output {'How To Make Birria Tacos [4nIFJFgH99w].mp4': '67211b6c1a5f2afde113615e', "How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4": '67211c451a5f2afde113615f', 'How To Make Potato Wedges [eZXbMWPJkKQ]
Calling Pegasus to Chat with Full Video
First, we find the video_id
associated with the video that matched our Chroma query.
video_id = pegasus_video_ids[found_full_video_name] print(video_id)
Next, we ask Pegasus what the ingredients are for birria tacos. It is exactly the same as the query we used before, but this time we are chatting with the entire video instead of just a 6 second segment.
res = twelvelabs_client.generate.text( video_id=video_id, prompt=query ) full_video_answer = res.data print(f"query {query}") print(f"{full_video_answer}")
Compare full video answer to segment answer
We can then compare this to the answer given when we chatted with the 6 second segment.
print(f"segment answer: \n{segment_answer}")
13 - Use an Open Source Model to Chat with a Full Video
Now we will compare Pegasus to the open source model when chatting with a full video.
After we sample all of the videos again, we can run our model on the full video, which outputs some more interesting answers.
# Sample all of the videos: sampled_database_video_info = process_videos_in_folder(upscaled_video_dir) # Each "content" is a list of dicts and you can add image/video/text modalities conversation = [ { "role": "user", "content": [ {"type": "text", "text": query}, {"type": "video"}, ], }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) prompt_len = len(prompt) inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device) generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9} output = model.generate(**inputs, **generate_kwargs) generated_text = processor.batch_decode(output, skip_special_tokens=True) print(generated_text[0])
Compare Result to Pegasus Answer
As we can see the open source model cannot give us an answer when chatting with the entire video either, showing the power of Pegasus:
print(f"Pegasus answer: \n{full_video_answer}")
14 - Comparison
Comparing our two models, it is clear that Pegasus has a clear understanding of our video and is able to give us an accurate response to our query. It understands that this is a tutorial for birria tacos and is able to give us the ingredient list. It is able to do this with the entire video and video segments.
The LLaVA-NeXT-Video open-source model understands our query, but is not able to extract the relevant information from the video. LLaVA-NeXT-Video does have some understanding that this is a cooking video, but does not have the richness in understanding needed for our query.
15 - Conclusion
This guide demonstrated how to interact with video content, either individually or as a complete set. It utilized the TwelveLabs Embed API and Chroma’s vector database for managing retrieval running locally on your machine.
We also compared TwelveLabs' Pegasus model with the LLaVA-NeXT-Video open-source model, evaluating the required infrastructure, developer experience, and query results. Compared to the open-source model, Pegasus showed promise with less operational overhead, better instruction following, and larger context lengths for easier querying of longer videos.
Appendix
For your reference and further exploration:
Complete Colab Notebook
Chroma’s Client and Documentation
Big thanks to Jeff Huber and Itai Smith from the Chroma team for reviewing the draft!
Welcome to this tutorial on integrating TwelveLabs' Embed API with Chroma’s Vector Database for RAG-based Q&A on videos. In this guide, you'll learn how to extract text answers from an unstructured database of videos using generative models.
We combine TwelveLabs' rich, contextual embeddings with Chroma’s vector database to store, index, and query these video embeddings, creating a chat application. This notebook demonstrates the current possibilities of these technologies with just a few lines of code.
For comparison, we also showcase the difference in developer experience between using TwelveLabs' Generate API to generate text responses and a leading open-source model, LLaVA-NeXT-Video.
1 - Overview
In this tutorial we will go through several examples where we use TwelveLabs Marengo to create video embeddings, use Chroma to store and query those embeddings to find relevant videos, and use TwelveLabs Pegasus and LLaVA-NeXT-Video to chat with the returned videos.
We will:
Create Video Embeddings Using the TwelveLabs Marengo Engine
Store Video Embeddings in a Chroma Database
Query Embeddings in our Chroma Database to Find Relevant Video Segments
Use TwelveLabs Pegasus to Chat with the Returned Video Segment
Use an Open Source Model to Chat with the Returned Video Segment
Compare Pegasus to the Open Source model
Use Chroma and TwelveLabs Embeddings to Search Multiple Videos
Use Pegasus to Chat with a Full Video
Use an Open Source Model to Chat with a Full Video
2 - Setup and Installation
Install Required Libraries
First, we will install the TwelveLabs and Chroma SDKs.
# Install Twelve Labs and Chroma libraries !pip install --upgrade twelvelabs !pip install --upgrade chromadb
Next, we will install the libraries to be used to run our open source model.
# Install libraries for use with the open source model !pip install protobuf==3.20.3 !pip install --upgrade -q accelerate bitsandbytes !pip install git+https://github.com/huggingface/transformers.git !pip install av
If you are running this tutorial outside Colab, uncomment the cell below to install the libraries used to view the video data we are working with.
# Extra Things to install if you're not on colab # !python -m pip install pillow # !python -m pip install sentencepiece # !python -m pip install matplotlib
3 - Preparing the Video Data
Next, we will prepare our video data.
Using our video Data
This demo uses video data from a TwelveLabs Google Drive folder. To use it, you'll need to link the folder to your Google Drive, and then mount your Google Drive to this Colab.
Linking the folder to our Google Drive
Anyone can access the folder with this link: https://drive.google.com/drive/folders/1k6FmkVglFsdtJG4MTIK-2dk1Dk9gTPtu?usp=share_link
To link this to the correct spot in your Google Drive:
Go to "Shared with me" in Google Drive.
Locate the shared folder you want to access.
Select "Organize" -> "Add Shortcut"
Choose "My Drive" as the destination and click "Add".
Now this folder should be accessible at /content/drive/MyDrive/TwelveLabs-Chroma
Mounting Drive
Here we mount your drive to this Colab.
from google.colab import drive drive.mount('/content/drive')
Set Video Path
Then we set the path for the videos we will be working with. This may change based on where you link the video folder.
video_folder_path = "/content/drive/MyDrive/TwelveLabs-Chroma/videos/"
Upscale Video Resolution
Some of our videos are too low resolution to use in the embedding engine, so we will double their resolution with upscale_video
.
import numpy as np import subprocess import os def upscale_video(input_file, output_path, target_width=854, target_height=480): output_file = os.path.join(output_path, os.path.basename(input_file)) if os.path.exists(output_file): print(f"Skipping {input_file} as {output_file} already exists.") return """ Upscale a video to the target width and height using FFmpeg. Args: input_file (str): Path to the input video file. output_file (str): Path to save the upscaled video. target_width (int): Desired output width. Default is 854. target_height (int): Desired output height. Default is 480. """ # FFmpeg command to upscale the video ffmpeg_command = [ 'ffmpeg', '-i', input_file, # Input file '-vf', f'scale={target_width}:{target_height}', # Scale filter with target dimensions '-c:a', 'copy', # Copy audio stream without re-encoding output_file, # Output file "-y" ] # Run the FFmpeg command subprocess.run(ffmpeg_command) print(f"Upscaled video saved to {output_file}")
First, we create the path where we will save the upscaled videos.
upscaled_video_dir = video_folder_path + "upscaled_videos/"
Then, we will upscale the videos. This will skip the videos we have already upscaled and placed in our upscaled video folder.
# Upscale all .mp4 videos # Create output directory if it doesn't exist if not os.path.exists(upscaled_video_dir): os.makedirs(upscaled_video_dir) # Iterate over all files in the raw video directory for filename in os.listdir(video_folder_path): # Check if the file is a video file input_filepath = os.path.join(video_folder_path, filename) if filename.endswith(".mp4"): upscale_video(input_filepath, upscaled_video_dir)
# Output ... Upscaled video saved to ../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4 ... Upscaled video saved to ../videos/upscaled_videos/How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4 ... Upscaled video saved to ../videos/upscaled_videos/How To Make Potato Wedges [eZXbMWPJkKQ]
4 - Create Video Embeddings Using the TwelveLabs Marengo Engine
Here we will use the TwelveLabs Marengo Engine to create embeddings for our video.
We will start by loading our TwelveLabs API key. This can be stored in your Colab secrets store which can be found by clicking the “Secrets” (Key Icon) on the left panel of your Colab and then clicking “+ Add new secret.”
from google.colab import userdata TL_API_KEY=userdata.get('TL_API_KEY')
Next we initiate the TwelveLabs and Chroma clients. The Chroma database in this tutorial is stored locally.
from twelvelabs import TwelveLabs from twelvelabs.models.embed import EmbeddingsTask # Initialize the Twelve Labs client twelvelabs_client = TwelveLabs(api_key=TL_API_KEY)
Here we use the Ephemeral Client, which is great for experimenting. It is run in memory, but data is not persisted after the application closes.
import chromadb # Initialize Ephemeral Chroma Client chroma_client = chromadb.Client()
Alternate Clients
If you need persistent data, you have two options:
1 - Persistent Client: This will save and load your database to your local disk.
# Persistent Client Option chroma_client = chromadb.PersistentClient(path="/path/to/save/to")
2 - HTTP Client: This will connect to a Chroma server running in a separate process - either self deployed, or on Chroma Cloud.
# Http Client chroma_client = chromadb.HttpClient(host='localhost', port=8000)
Create Video Embeddings and Format for Chroma
Here we create video embeddings using Marengo and format for Chroma. To upload data to Chroma you need three separate lists for all the data that you want to upload: embeddings, meta-datas, and ids.
def on_task_update(task: EmbeddingsTask): print(f" Status={task.status}") # Create video embeddings and format for Chroma def create_video_embeddings(client,video_file,segment_length,task_id=None): #upload video to twelve labs if it does not already exist video_name = os.path.basename(video_file) if task_id == None or task_id == "": task = client.embed.task.create( engine_name="Marengo-retrieval-2.7", video_file=video_file, video_clip_length=segment_length ) print( f"Created task: id={task.id} engine_name={task.engine_name} status={task.status}" ) status = task.wait_for_done( sleep_interval=2, callback=on_task_update ) print(f"Embedding done: {status}") task_id = task.id #fetch embeddings task = client.embed.task.retrieve(task_id) print("task",task) #format for chroma embeddings = [] metadatas = [] ids = [] idx = 0 print("embeddings",task.video_embeddings) if task.video_embeddings is not None: for v in task.video_embeddings: metadata = { "embedding_scope":v.embedding_scope, "start_offset_sec":v.start_offset_sec, "end_offset_sec":v.end_offset_sec, "video_file":video_file, "video_name":video_name, "task_id":task.id, "video_segment_number":idx } embedding = v.values id = task.id + "_" + str(idx) metadatas.append(metadata) embeddings.append(embedding) ids.append(id) idx += 1 return (ids,metadatas,embeddings,task_id)
Next, we select the video we will be working with and set a 6 second embedding segment duration - each embedding will refer to a 6 second segment in the video.
# Set the segment duration and the video we will be working with segment_duration = 6 current_video_path = upscaled_video_dir + "How To Make Birria Tacos [4nIFJFgH99w].mp4"
Then, we retrieve our embeddings data to be uploaded to Chroma. We also get our embeddings task_id
from TwelveLabs. This task _id
can be used to retrieve embeddings for this video in the future so we don’t have to run the embedding model multiple times for the same video.
# Get embeddings to upload to Chroma # Set task_id if you already have one, otherwise set to empty string task_id = "" ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,current_video_path,segment_duration,task_id)
You can find the task_id
in the output.
5 - Store Video Embeddings in a Chroma Database
Now that we have our records and vectors in an easy format, we can simply add them to a new collection in Chroma (or, an existing collection if it already exists).
# Fetch or create a Chroma Collection chroma_collection_name = "video_embeddings" collection = chroma_client.get_or_create_collection(chroma_collection_name) # Add embeddings and metadata to our collection collection.add( metadatas = metadatas, embeddings = embeddings, ids=ids )
6 - Query Embeddings in our Chroma Database to Find Relevant Video Segments
Testing the Vector Search
Now that we have everything in the collection, we can test and see that the embeddings query works. We will search by the first returned embedding, which should have a distance of zero to itself.
# Use first embedding as a test search test_segment_embeddings = embeddings[0] results = collection.query( query_embeddings=[test_segment_embeddings], n_results=4 ) print("search embeddings for:",ids[0]) print("found:", results["ids"][0][0]) print("distance:",results["distances"][0][0]) #assert that the first video's text embedding is distance 0 from itself assert results["ids"][0][0] == ids[0] assert results["distances"][0][0] == 0
Querying our Vector Database
Now we will show the power of TwelveLabs embedding and the Chroma vector database to efficiently query a large number of videos.
The video we chose earlier is a cooking tutorial for birria tacos. We will use TwelveLabs and Chroma to figure out what ingredients we need to create birria tacos.
We set that query here.
query = "What are the ingredients for birria tacos?"
Next, we create a function that uses TwelveLabs Marengo to embed a text query and then query our chroma collection using that embedding.
import os def query_chroma(collection,query,n_results=1): #Create embedding for query embedding = twelvelabs_client.embed.create( engine_name="Marengo-retrieval-2.7", text=query, text_truncate="start", ) query_embeddings = embedding.text_embedding.float #Search Chroma database with query embedding response = collection.query( query_embeddings=query_embeddings, n_results=n_results, ) return response
Then, we run our Chroma query to find the 6 second segment of our tutorial that best answers this query. Later in this tutorial we will pass this segment into TwelveLabs Pegasus and an open source model to ask it questions including which ingredients we need.
Our metadata has information on the segment and the video it came from.
response = query_chroma(collection,query) # Print the properties and distance of the most similar object print(response["ids"][0][0]) print(response["distances"][0][0]) print(response["metadatas"][0][0]) # Get the path for the found video segment for the next step found_video_metadata = response["metadatas"][0][0]
# Output 672101d56025850d8c890d1c_6 1.3052971363067627 {'embedding_scope': 'clip', 'end_offset_sec': 42.0, 'start_offset_sec': 36.0, 'task_id': '672101d56025850d8c890d1c', 'video_file': '../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4', 'video_name': 'How To Make Birria Tacos [4nIFJFgH99w]
7 - Splitting Videos into Segments
While we could upload the entire tutorial video to Pegasus, we will only upload the relevant 6 second segment we found in the previous step. This allows us to be efficient with compute.
Here we split the video we are working with into segments that match our embeddings segments and put them into the split videos folder.
split_video_dir = video_folder_path + "split_videos/" def split_video(input_path, output_dir, segment_duration=6): if not os.path.exists(output_dir): os.makedirs(output_dir) filename = os.path.splitext(os.path.basename(input_path))[0] filetype = os.path.splitext(os.path.basename(input_path))[1] # Split video into segments ffmpeg_command = [ 'ffmpeg', '-i', input_path, # Input video file '-c', 'copy', # Copy both video and audio codecs '-f', 'segment', # Segment mode '-segment_time', str(segment_duration), # Segment length '-reset_timestamps', '1', # Reset timestamps for each segment output_dir + filename + '_%03d' + filetype # Output filename pattern (e.g., output_001.mp4) ] # Run the command subprocess.run(ffmpeg_command) print("Video split into 6-second segments successfully.") # Split the video into segments split_video(input_path=current_video_path, output_dir=split_video_dir, segment_duration=segment_duration)
# Output ... [segment @ 0x561c0a7b4800] Opening '../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]_001.mp4' for writing ... [segment @ 0x561c0a7b4800] Opening '../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]
8 - Use TwelveLabs Pegasus to Chat with the Returned Video Segment
These next few cells will show us how simple it is to use Pegasus to chat with a video -- Everything comes ready out of the box.
Uploading Video Segment to Pegasus
First, we will create an index for our video uploads and the Pegasus Engine, then upload them.
# Create or retrieve pegasus index engines = [ { "name": "pegasus1.2", "options": ["visual", "conversation"] } ] index_name = "cooking_video_index" indices_list = twelvelabs_client.index.list(name=index_name) if len(indices_list) == 0: index = twelvelabs_client.index.create( name=index_name, engines=engines, ) print(f"A new index has been created: id={index.id} name={index.name} engines={index.engines}") else: index = indices_list[0] print(f"Index already exists: id={index.id} name={index.name} engines={index.engines}")
# Output
A new index has been created: id=672120b06025850d8c890db2
name=cooking_video_index
engines=root=[Engine(name='pegasus1.2', options=['visual', 'conversation']
Get Video Segment File Name
Next, we will use the metadata from our Chroma query to find the segment that matched our query.
# Get video segment filename found_video_segment_number = int(found_video_metadata["video_segment_number"]) found_video_file = found_video_metadata["video_file"] found_video_filename = os.path.splitext(os.path.basename(found_video_file))[0] found_video_filetype = os.path.splitext(os.path.basename(found_video_file))[1] found_video_segment_filename = found_video_filename + f"_{found_video_segment_number:03d}" found_video_segment_path = split_video_dir + found_video_segment_filename + found_video_filetype print(found_video_segment_path)
# Output
../videos/split_videos/How To Make Birria Tacos [4nIFJFgH99w]
Upload Video to Pegasus and Get Video ID
We then create a function to upload our video segment to TwelveLabs Pegasus.
This function returns a video_id
that we will use to chat with the video. We can store this video_id
for multi-turn chat.
def upload_video_to_twelve_labs(index,video_path): #upload our video to our twelve labs index task = twelvelabs_client.task.create( index_id=index.id, file = video_path ) print(f"Task created: id={task.id} status={task.status}") task.wait_for_done(sleep_interval=5, callback=on_task_update) if task.status != "ready": raise RuntimeError(f"Indexing failed with status {task.status}") print(f"The unique identifer of your video is {task.video_id}.") #return the video id return task.video_id
If we already have a video id for this video segment, we can store it here.
# Set video_id if you already have one, otherwise set to empty string video_id = "" # Upload video to get video id to chat with in Pegasus if video_id == "": video_id = upload_video_to_twelve_labs(index,found_video_segment_path)
You can find the video_id
in the output.
Calling Pegasus
Here we query the video segment with the query we used to find it. Because TwelveLabs handles all of the boilerplate behind the scenes, we can call our model with a simple function.
9 - Use an Open Source Model to Chat with the Returned Video Segment
Now we will use an open source model to chat with our video segment and compare it to Pegasus.
First, we need to sample the videos ourselves for the model to consume. We'll modify the LLaVa-NeXT-Video Sampling code to get a uniform sample of 8 frames for each video.
And we can do this for all of the video segments in our folder.
read_video_pyav
comes directly from the LLaVa-NeXT-Video Colab notebook and it formats videos in the correct numpy representation for inference.
import av def read_video_pyav(container, indices): ''' Decode the video with PyAV decoder. Args: container (av.container.input.InputContainer): PyAV container. indices (List[int]): List of frame indices to decode. Returns: np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3). ''' frames = [] container.seek(0) start_index = indices[0] end_index = indices[-1] for i, frame in enumerate(container.decode(video=0)): if i > end_index: break if i >= start_index and i in indices: frames.append(frame) return np.stack([x.to_ndarray(format="rgb24") for x in frames]) def sample_video(video_path, num_samples=8): container = av.open(video_path) # sample uniformly num_samples frames from the video total_frames = container.streams.video[0].frames indices = np.arange(0, total_frames, total_frames / num_samples).astype(int) sampled_frames = read_video_pyav(container, indices) return sampled_frames def process_videos_in_folder(folder_path): sample_info = {} # Supported video file extensions video_extensions = ('.mp4', '.avi', '.mov', '.mkv') for filename in os.listdir(folder_path): simple_video_name = os.path.splitext(os.path.basename(filename))[0] if filename.lower().endswith(video_extensions): video_path = os.path.join(folder_path, filename) try: sampled_clip = sample_video(video_path) sample_info[simple_video_name] = {"sampled_video": sampled_clip, "video_path" : video_path} except Exception as e: print(f"Error processing {filename}: {str(e)}") return sample_info sampled_video_info = process_videos_in_folder(split_video_dir) # Get video segment found in our Chroma query video_segment = sampled_video_info[found_video_segment_filename]['sampled_video']
Setting up our Model
We'll set up our model in 4-bit quantization to speed up inference.
from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor import torch quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 ) processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf") model = LlavaNextVideoForConditionalGeneration.from_pretrained( "llava-hf/LLaVA-NeXT-Video-7B-hf", quantization_config=quantization_config, device_map='auto' ) # To use later to play the videos in the notebook itself from matplotlib import pyplot as plt from matplotlib import animation from IPython.display import HTML # np array with shape (frames, height, width, channels) # pick one at random just to see video = sampled_video_info[list(sampled_video_info.keys())[0]]['sampled_video'] fig = plt.figure() im = plt.imshow(video[0,:,:,:]) plt.close() # this is required to not display the generated image def init(): im.set_data(video[0,:,:,:]) def animate(i): im.set_data(video[i,:,:,:]) return im anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0], interval=100) HTML(anim.to_html5_video())
Running the Model
Now that we have our query and the relevant video, we can feed them into the model to get an output.
# Each "content" is a list of dicts and you can add image/video/text modalities conversation = [ { "role": "user", "content": [ {"type": "text", "text": query}, {"type": "video"}, ], }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) prompt_len = len(prompt) inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device) generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9} output = model.generate(**inputs, **generate_kwargs) open_source_segment_generated_text = processor.batch_decode(output, skip_special_tokens=True) print(open_source_segment_generated_text[0])
10 - Compare Pegasus to the Open Source model
As we can see Pegasus does a better job at answering our query (printed out again for convenience).
print(f"query {query}") print("pegasus answer") print(segment_answer) print("open source answer") print(open_source_segment_generated_text[0])
11 - Using Chroma and TwelveLabs Embeddings to Search Multiple Videos
We have already shown how we can use TwelveLabs Marengo and Chroma can be used to find relevant segments within a video.
Now we can show the true power of Marengo and Chroma for a RAG use case — by querying a large set of potential videos to find the specific video that we need to answer the query.
Embedding our Video Database:
First we will embed all of our videos and store those embeddings in Chroma:
# Embed and store task ID's for all videos chroma_collection_name = "video_embeddings" collection = chroma_client.get_or_create_collection(chroma_collection_name)
Then, we create a dictionary to store task_ids for each video in case we need to retrieve these embeddings from TwelveLabs in the future.
# Store twelve labs task ids for each video task_ids = {}
Next, we get the embeddings for each video using the create_video_embeddings function we created earlier and upload them to our Chroma collection.
# Get embeddings and metadata for each video # Store task ids so we don't upload videos multiple times for filename in os.listdir(upscaled_video_dir): if filename.endswith(".mp4"): if (filename in task_ids.keys()): task_id = task_ids[filename] else: task_id = None file_path = os.path.join(upscaled_video_dir, filename) ids, metadatas, embeddings, task_id = create_video_embeddings(twelvelabs_client,file_path,segment_duration,task_id) task_ids[filename] = task_id collection.add( metadatas = metadatas, embeddings = embeddings, ids=ids )
print(task_ids)
# Output {'How To Make Birria Tacos [4nIFJFgH99w].mp4': '67211a5430c547875065b3f7', "How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4": '67211ab56025850d8c890d93', 'How To Make Potato Wedges [eZXbMWPJkKQ]
Querying our Database
Here we use the same query as before to find the full video that will answer our query.
response = query_chroma(collection,query) found_full_video_name = response["metadatas"][0][0]["video_name"] print(found_full_video_name)
# Output
How To Make Birria Tacos [4nIFJFgH99w]
12 - Use Pegasus to Chat with a Full Video
We already have an index created, so we just need to upload the videos to this index then call Pegasus.
# Store pegasus video ids so that we don't upload videos multiple times pegasus_video_ids = {}
We upload each video here and store the video ids we will use to chat with the video.
for upscaled_video in os.listdir(upscaled_video_dir): upscaled_video_path = os.path.join(upscaled_video_dir, upscaled_video) print(upscaled_video_path) if upscaled_video not in pegasus_video_ids: video_id = upload_video_to_twelve_labs(index,upscaled_video_path) pegasus_video_ids[upscaled_video] = video_id
# Output ../videos/upscaled_videos/How To Make Birria Tacos [4nIFJFgH99w].mp4 Task created: id=67211b6b30c547875065b407 status=validating Status=pending ... Status=indexing ... Status=ready The unique identifer of your video is 67211b6c1a5f2afde113615e. ../videos/upscaled_videos/How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4 Task created: id=67211c446025850d8c890da5 status=validating Status=pending ... Status=indexing ... Status=ready The unique identifer of your video is 67211c451a5f2afde113615f. ../videos/upscaled_videos/How To Make Potato Wedges [eZXbMWPJkKQ]
print(pegasus_video_ids)
# Output {'How To Make Birria Tacos [4nIFJFgH99w].mp4': '67211b6c1a5f2afde113615e', "How To Make a McDonald's Cheeseburger [SvOx7tA_Cv8].mp4": '67211c451a5f2afde113615f', 'How To Make Potato Wedges [eZXbMWPJkKQ]
Calling Pegasus to Chat with Full Video
First, we find the video_id
associated with the video that matched our Chroma query.
video_id = pegasus_video_ids[found_full_video_name] print(video_id)
Next, we ask Pegasus what the ingredients are for birria tacos. It is exactly the same as the query we used before, but this time we are chatting with the entire video instead of just a 6 second segment.
res = twelvelabs_client.generate.text( video_id=video_id, prompt=query ) full_video_answer = res.data print(f"query {query}") print(f"{full_video_answer}")
Compare full video answer to segment answer
We can then compare this to the answer given when we chatted with the 6 second segment.
print(f"segment answer: \n{segment_answer}")
13 - Use an Open Source Model to Chat with a Full Video
Now we will compare Pegasus to the open source model when chatting with a full video.
After we sample all of the videos again, we can run our model on the full video, which outputs some more interesting answers.
# Sample all of the videos: sampled_database_video_info = process_videos_in_folder(upscaled_video_dir) # Each "content" is a list of dicts and you can add image/video/text modalities conversation = [ { "role": "user", "content": [ {"type": "text", "text": query}, {"type": "video"}, ], }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) prompt_len = len(prompt) inputs = processor([prompt], videos=[video_segment], padding=True, return_tensors="pt").to(model.device) generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9} output = model.generate(**inputs, **generate_kwargs) generated_text = processor.batch_decode(output, skip_special_tokens=True) print(generated_text[0])
Compare Result to Pegasus Answer
As we can see the open source model cannot give us an answer when chatting with the entire video either, showing the power of Pegasus:
print(f"Pegasus answer: \n{full_video_answer}")
14 - Comparison
Comparing our two models, it is clear that Pegasus has a clear understanding of our video and is able to give us an accurate response to our query. It understands that this is a tutorial for birria tacos and is able to give us the ingredient list. It is able to do this with the entire video and video segments.
The LLaVA-NeXT-Video open-source model understands our query, but is not able to extract the relevant information from the video. LLaVA-NeXT-Video does have some understanding that this is a cooking video, but does not have the richness in understanding needed for our query.
15 - Conclusion
This guide demonstrated how to interact with video content, either individually or as a complete set. It utilized the TwelveLabs Embed API and Chroma’s vector database for managing retrieval running locally on your machine.
We also compared TwelveLabs' Pegasus model with the LLaVA-NeXT-Video open-source model, evaluating the required infrastructure, developer experience, and query results. Compared to the open-source model, Pegasus showed promise with less operational overhead, better instruction following, and larger context lengths for easier querying of longer videos.
Appendix
For your reference and further exploration:
Complete Colab Notebook
Chroma’s Client and Documentation