Products

For Developers

Company

Tutorial

Tutorial

Tutorial

Building a Multimodal Retrieval-Augmented Generation Application with Twelve Labs and Milvus

Hrishikesh Yadav

Hrishikesh Yadav

Hrishikesh Yadav

Discover your perfect style with Fashion AI Assistant! This application combines the power of visual search, conversational AI, and video understanding to innovate how you explore and find fashion. Whether you're chatting about your style preferences or sharing a photo of an outfit you want, the application helps you discover exactly what you're looking for ✨

Discover your perfect style with Fashion AI Assistant! This application combines the power of visual search, conversational AI, and video understanding to innovate how you explore and find fashion. Whether you're chatting about your style preferences or sharing a photo of an outfit you want, the application helps you discover exactly what you're looking for ✨

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

Jan 22, 2025

Jan 22, 2025

Jan 22, 2025

13 Min

13 Min

13 Min

Copy link to article

Copy link to article

Copy link to article

Introduction

Ever wondered what it would be like to have a personal fashion advisor available 24/7 who understands not just your words but also your visual preferences? Imagine showing a photo of an outfit you like and getting instant AI suggestions for similar styles, or describing your dream look and receiving matching outfit recommendations 👚

Our Fashion AI assistant application is built for next-generation product discovery. This tutorial explores our advanced fashion recommendation system that combines multimodal retrieval with conversational AI. The system uses TwelveLabs Embed - Marengo-retrieval-2.7 model to generate embeddings from both text and video content, enabling seamless search across different data types.

The application uses Milvus as its vector database for efficient similarity search and integrates OpenAI's gpt-3.5 for natural language processing. It handles fashion queries through both text and image inputs, finding relevant product descriptions and precise video timestamps to show users exactly where their desired items appear.

This powerful combination creates a system that understands both textual and visual queries while delivering contextualized responses enriched with multimodal content.

Let's explore how this application works and how you can build similar solutions using the TwelveLabs Python SDK and Milvus Python SDK.

You can explore the demo of the application here: Fashion AI Chat App.

If you want to access the code and experiment with the app directly, you can use this Replit Template.

Prerequisites

Working of the Application

Here's an overview of how the application works and how its components interact:

__wf_reserved_inherit

The application has three main features:

  1. Embedding Generation and Insertion into the Milvus Vector Database

  2. Query Image to Video Segment Retrieval

  3. Chat Retrieval Augmented Generation (RAG) for Multimodal Retrieval (Text and Video Segments)

This tutorial application shows how to store product catalog data in a vector database using Twelve Labs Embed - Marengo-retrieval-2.7. The embeddings for both text and video content are stored in the same Milvus vector database collection. For clarity and reusability, the Twelve Labs embedding function is defined in each utility function.

To make the application work as expected, you must insert the embedding data into the collection, or you can refer to the sample data provided here used in this application.

Preparation Steps

  1. Obtain your API key from the Twelve Labs Playground and set up your environment variable.

  2. Clone the project from Github or use the Replit Template. We recommend using the Replit template since it comes with a predefined virtual environment. Remember to add the secret file and consult the secret key replit docs during setup.

  3. Create a .env file containing your Twelve Labs, Milvus Connection, and OpenAI API keys for the LLM model.

For the Milvus vector database, credentials are configured using Zilliz Cloud, though other connection methods are available in the Installation Guide.

TWELVELABS_API_KEY="your_twelvelabs_key"
COLLECTION_NAME="your_collection_name"
URL="your_milvus_url"
TOKEN="your_milvus_token"
OPENAI_API_KEY="your_openai_key"

Once you've completed these steps, you're ready to start developing!

Walkthrough for Fashion Chat Application

In this tutorial, we'll create a Streamlit application with a minimal frontend. Let's look at the directory structure:

├── pages/
├── add_product_page.py
└── visual_search.py
├── .gitignore
├── app.py
└── utils.py  
└── requirements.txt

Creating the Streamlit Application

Now that you've finished the setup, let's build the Streamlit application.

Find all required dependencies for creating your virtual environment here: requirements.txt

Create a virtual Python environment, then set up the environment for the application using the following command:

pip install -r requirements.txt

Defining the Utility function

This section describes the utility.py, which contains the core logic and implementation. Let's examine each subsection:

1 - Setting up the Connection

The first step loads environment variables and establishes connections with OpenAI for LLM access, Milvus for collection management, and Twelve Labs initialization.

# Load environment variables
COLLECTION_NAME = os.getenv('COLLECTION_NAME')
URL = os.getenv('URL')
TOKEN = os.getenv('TOKEN')
TWELVELABS_API_KEY = os.getenv('TWELVELABS_API_KEY')

# Initialize connections
openai_client = OpenAI()
connections.connect(uri=URL, token=TOKEN)
collection = Collection(COLLECTION_NAME)
collection.load()

twelvelabs_client = TwelveLabs(api_key=TWELVELABS_API_KEY)

2 - Generation of Embedding with Twelve Labs Embed API

This section covers embedding generation using the Twelve Labs SDK with Marengo-retrieval-2.7 for multimodal retrieval. The solution integrates text and video processing to create efficient search embeddings.

# Generate text and segmented video embeddings for a product
def generate_embedding(product_info):
    try:
        st.write("Starting embedding generation process...")
        st.write(f"Processing product: {product_info['title']}")
        
        # Initialize the Twelve Labs client
 twelvelabs_client = TwelveLabs(api_key=TWELVELABS_API_KEY)
        st.write("TwelveLabs client initialized successfully")
        
        st.write("Attempting to generate text embedding...")

        # Formatting the Text Data of Product Catalogue
        text = f"product type: {product_info['title']}. " \
               f"product description: {product_info['desc']}. " \
               f"product category: fashion apparel."
               
        st.write(f"Generating embedding for text: {text}")
        
        # Generating the Text Embeddings
        text_embedding = twelvelabs_client.embed.create(
            model_name="Marengo-retrieval-2.7",
            text=text
        ).text_embedding.segments[0].embeddings_float
        st.write("Text embedding generated successfully")
        
        # Create and wait for video embedding task
        st.write("Creating video embedding task...")
        video_task = twelvelabs_client.embed.task.create(
            model_name="Marengo-retrieval-2.7",
            video_url=product_info['video_url'],
            video_clip_length=6
        )
        
        def on_task_update(task):
            st.write(f"Video processing status: {task.status}")
        
        st.write("Waiting for video processing to complete...")
        video_task.wait_for_done(sleep_interval=2, callback=on_task_update)
        
        # Retrieve segmented video embeddings
        video_task = video_task.retrieve()
        if not video_task.video_embedding or not video_task.video_embedding.segments:
            raise Exception("Failed to retrieve video embeddings")
        
        video_segments = video_task.video_embedding.segments
        st.write(f"Retrieved {len(video_segments)} video segments")
        
        video_embeddings = []
        for segment in video_segments:
            video_embeddings.append({
                'embedding': segment.embeddings_float,
                'metadata': {
                    'scope': 'clip',
                    'start_time': segment.start_offset_sec,
                    'end_time': segment.end_offset_sec,
                    'video_url': product_info['video_url']
                }
            })
        
        return {
            'text_embedding': text_embedding,
            'video_embeddings': video_embeddings
        }, None
        
    except Exception as e:
        st.error("Error in embedding generation")
        st.error(f"Error message: {str(e)}")
        return None, str(e)

The process begins by structuring and formatting text to ensure consistent, meaningful embeddings. Video embedding follows, with status monitoring and progress tracking. Each video segment is set to 6 seconds (video_clip_length=6).

Metadata tracking is essential for maintaining traceability and enabling precise video segment and text searching. The results are then stored in the Milvus vector database collection.

3 - Embedding Insertion into the Milvus Vector Database

Vector databases efficiently store and retrieve high-dimensional embedding data. This section demonstrates how to insert both text and video embeddings into the Milvus vector database with proper metadata management and error handling.

# Insert text and all video segment embeddings into Milvus Collection
def insert_embeddings(embeddings_data, product_info):
    try:
        metadata = {
            "product_id": product_info['product_id'],
            "title": product_info['title'],
            "description": product_info['desc'],
            "video_url": product_info['video_url'],
            "link": product_info['link']
        }
        
        # Insert text embedding
        text_entry = {
            "id": int(uuid.uuid4().int & (1<<63)-1),
            "vector": embeddings_data['text_embedding'],
            "metadata": metadata,
            "embedding_type": "text"
        }
        collection.insert([text_entry])
        st.write("Text embedding inserted successfully")
        
        # Insert each video segment embedding
        for video_segment in embeddings_data['video_embeddings']:
            video_entry = {
                "id": int(uuid.uuid4().int & (1<<63)-1),
                "vector": video_segment['embedding'],
                "metadata": {**metadata, **video_segment['metadata']},
                "embedding_type": "video"
            }
            collection.insert([video_entry])
        
        st.write(f"Inserted {len(embeddings_data['video_embeddings'])} video segment embeddings")
        return True
        
    except Exception as e:
        st.error(f"Error inserting embeddings: {str(e)}")
        return False

After defining the id, vector, and metadata, the embedding_type is specified to accommodate different formats in the same collection space. The data is then inserted using the collection.insert method.

4 - Semantic Search Implementation for Image to Video Retrieval

Semantic search with image queries enables finding similar video content. This implementation combines Twelve Labs embedding capabilities with Milvus vector search. Here's how the utility function searched_similar_videos works:

__wf_reserved_inherit

First, the system loads and reads the product image to find video segments containing similar products. Twelve Labs embed then converts the image into an embedding for semantic search against stored video embeddings in the Milvus collection.

# Load and Read Query Image
image_path = "path/to/your/image.jpg" 
with open(image_path, 'rb') as f: 
    image_file = f.read()

# Intializing of client and Generation of Image Emebedding
twelvelabs_client = TwelveLabs(api_key=TWELVELABS_API_KEY)
image_embedding = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    image_file=image_file
).image_embedding.segments[0].embeddings_float

The search uses cosine similarity and cluster parameters to balance speed and accuracy. When collection.search runs with embedding_type == 'video' and other parameters, it retrieves the specified number of product video segments.

search_params = {
    "metric_type": "COSINE",
    "params": {
        "nprobe": 1024,
        "ef": 64
    }
}

# Search relevant video segments
results = collection.search(
    data=[image_embedding],
    anns_field="vector",
    param=search_params,
    limit=2,
    expr="embedding_type == 'video'",
    output_fields=["metadata"]
)

The system processes the results, converts similarity scores to percentages for clarity, and organizes the metadata into a structured dictionary.

search_results = []
for hits in results:
    for hit in hits:
        metadata = hit.metadata
        # Convert score from [-1,1] to [0,100] range
        similarity = round((hit.score + 1) * 50, 2)
        similarity = max(0, min(100, similarity))
        
        search_results.append({
            'Title': metadata.get('title', ''),
            'Description': metadata.get('description', ''),
            'Link': metadata.get('link', ''),
            'Start Time': f"{metadata.get('start_time', 0):.1f}s",
            'End Time': f"{metadata.get('end_time', 0):.1f}s",
            'Video URL': metadata.get('video_url', ''),
            'Similarity': f"{similarity}%",
            'Raw Score': hit.score
        })

# Sort by similarity score in descending order
search_results.sort(key=lambda x: float(x['Similarity'].rstrip('%')), reverse=True)

The sorted results can then be displayed. The UI implementation is available in vision_search.

5 - Developing Retrieval Augmented Generation System

This section focuses on question-answering in the fashion assistant context, combining multimodal retrieval with LLM (gpt-3.5) for contextual responses. It emphasizes multimodal retrieval from text and video data with metadata for effective personalization. The get_rag_response function implements the RAG functionality.

__wf_reserved_inherit

The process begins by converting the user's question and context into a text embedding using the Marengo-retrieval-2.7 engine with Twelve Labs embed, enabling semantic search across the database.

# Sample question
question_with_context = f"fashion product: Suggest me black dresses for a party"

# Generate embedding for the question
question_embedding = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    text=question_with_context
).text_embedding.segments[0].embeddings_float

After setting semantic search parameters, the system performs separate collection searches for text and video segments. Video segment retrieval finds the most relevant segments matching the text query, providing users with comprehensive exploration options.

The expr method refines search factors, while different limits apply to text and video searches. A typical search retrieves two text products with metadata and three relevant video segments.

# Configure search parameters
search_params = {
    "metric_type": "COSINE",
    "params": {
        "nprobe": 1024,
        "ef": 64
    }
}

# Search for relevant text embeddings
text_results = collection.search(
    data=[question_embedding],
    anns_field="vector",
    param=search_params,
    limit=2,  # Get top 2 text matches
    expr="embedding_type == 'text'",
    output_fields=["metadata"]
)

# Search for relevant video segments
video_results = collection.search(
    data=[question_embedding],
    anns_field="vector",
    param=search_params,
    limit=3,  # Get top 3 video segments
    expr="embedding_type == 'video'",
    output_fields=["metadata"]
)

During result processing, the system extracts metadata from each segment, converts similarity scores to percentages, and organizes product details in a structured dictionary.

# Process text results
text_docs = []
for hits in text_results:
    for hit in hits:
        metadata = hit.metadata
        similarity = round((hit.score + 1) * 50, 2)
        similarity = max(0, min(100, similarity))
        
        text_docs.append({
            "title": metadata.get('title', 'Untitled'),
            "description": metadata.get('description', 'No description available'),
            "product_id": metadata.get('product_id', ''),
            "video_url": metadata.get('video_url', ''),
            "link": metadata.get('link', ''),
            "similarity": similarity,
            "raw_score": hit.score,
            "type": "text"
        })

# Process video results
video_docs = []
for hits in video_results:
    for hit in hits:
        metadata = hit.metadata
        similarity = round((hit.score + 1) * 50, 2)
        similarity = max(0, min(100, similarity))
        
        video_docs.append({
            "title": metadata.get('title', 'Untitled'),
            "description": metadata.get('description', 'No description available'),
            "product_id": metadata.get('product_id', ''),
            "video_url": metadata.get('video_url', ''),
            "link": metadata.get('link', ''),
            "similarity": similarity,
            "raw_score": hit.score,
            "start_time": metadata.get('start_time', 0),
            "end_time": metadata.get('end_time', 0),
            "type": "video"
        })

After gathering relevant information, the system focuses on retrieval-augmented generation. The retrieved context is structured for efficient LLM processing.

The system defines both System and User Prompts. The System Prompt establishes the fashion assistant context, while the User Prompt combines the user's query with retrieved database context. The system includes error handling for cases where context retrieval fails.

# Handle case when no results found
if not text_docs and not video_docs:
    response_data = {
        "response": "I couldn't find any matching products. Try describing what you're looking for differently.",
        "metadata": None
    }
else:
    # Create context from text results for LLM
    text_context = "\n\n".join([
        f"Product: {doc['title']}\nDescription: {doc['description']}\nLink: {doc['link']}"
        for doc in text_docs
    ])

    # Prepare messages for OpenAI
    messages = [
        {
            "role": "system",
            "content": """You are a professional fashion advisor and AI shopping assistant.
            Organize your response in the following format:

            First, provide a brief, direct answer to the user's query
            Then, describe any relevant products found that match their request, including:
               - Product name and key features
               - Why this product matches their needs
               - Style suggestions for how to wear or use the item
            Finally, provide any additional style advice or recommendations
            
            Keep your response engaging and natural while maintaining this clear structure.
            Focus on being helpful and specific rather than promotional."""
        },
        {
            "role": "user",
            "content": f"""Query: {question}

Available Products:
{text_context}

Please provide fashion advice and product recommendations based on these options."""
        }
    ]

    # Get response from OpenAI
    chat_response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature=0.5,
        max_tokens=500
    )

    # Format final response
    response_data = {
        "response": chat_response.choices[0].message.content,
        "metadata": {
            "sources": text_docs + video_docs,
            "total_sources": len(text_docs) + len(video_docs),
            "text_sources": len(text_docs),
            "video_sources": len(video_docs)
        }
    }

The text_context contains the retrieved information. Using gpt-3.5-turbo, the system generates responses with explanations, suggestions, and user guidance as specified in the system prompt. The response is formatted in an accessible dictionary structure. For UI implementation details, see the app.

Demo Application

This demo has three main parts, each with a specific purpose: simplifying database insertion, enabling efficient multimodal retrieval, and improving search functionality for users.

  • Generation and insertion of text and video embeddings into the database

  • Retrieval of multimodal data through LLM chat interactions

  • Product image-based search for finding relevant video segments

The product data catalog includes fields such as product ID, text, description, link, and video URL. Text data is converted into embeddings, while other data is stored as metadata. For video segments, the system processes the video URL. All embeddings are then stored in a single collection.

__wf_reserved_inherit

For multimodal retrieval, here's an example query: "I'm looking for a men's black T-shirt"—

__wf_reserved_inherit

For semantic video search, when using a black t-shirt image as a query with a limit of top 2 results—

__wf_reserved_inherit

More Ideas to Experiment with the Tutorial

Understanding an application's workings and development process helps you create innovative products that serve users' needs. Here are some potential use cases for video content creators:

🛍️ E-commerce: Enhance product search and recommendations using text and image queries

🎵 Music Discovery: Find similar songs, artists, or genres based on audio clips and user preferences

🎥 Intelligent Video Search Engine: Find videos based on their visual and audio content, helping content creators, journalists, and researchers

🗺️ Personalized Travel Planner: Create travel itineraries based on user preferences, reviews, and destination data

📚Educational Resource Management: Organize and find learning materials—including documents, presentations, and videos—based on content needs

Conclusion

This tutorial demonstrates how multimodal retrieval works through embeddings, powered by Twelve Labs and Milvus vector database. This implementation shows developers how to build scalable applications that handle both visual and textual data. Thanks for following along—we welcome your ideas for improving user experience and solving various challenges.

Additional Resources

Learn more about the embedding generation engine—Marengo-retrieval-2.7. To explore Twelve Labs further and deepen your understanding of video content analysis, check out these resources:

  • Discord Community: Join our vibrant community of developers and enthusiasts to discuss ideas, ask questions, and share your projects. Join the Twelve Labs Discord

  • Sample Applications: Explore a variety of sample applications to inspire your next project or learn new implementation techniques

  • Explore Tutorials: Dive deeper into Twelve Labs capabilities with our comprehensive tutorials

We encourage you to leverage these resources to expand your knowledge and create innovative applications using Twelve Labs video understanding technology.

Introduction

Ever wondered what it would be like to have a personal fashion advisor available 24/7 who understands not just your words but also your visual preferences? Imagine showing a photo of an outfit you like and getting instant AI suggestions for similar styles, or describing your dream look and receiving matching outfit recommendations 👚

Our Fashion AI assistant application is built for next-generation product discovery. This tutorial explores our advanced fashion recommendation system that combines multimodal retrieval with conversational AI. The system uses TwelveLabs Embed - Marengo-retrieval-2.7 model to generate embeddings from both text and video content, enabling seamless search across different data types.

The application uses Milvus as its vector database for efficient similarity search and integrates OpenAI's gpt-3.5 for natural language processing. It handles fashion queries through both text and image inputs, finding relevant product descriptions and precise video timestamps to show users exactly where their desired items appear.

This powerful combination creates a system that understands both textual and visual queries while delivering contextualized responses enriched with multimodal content.

Let's explore how this application works and how you can build similar solutions using the TwelveLabs Python SDK and Milvus Python SDK.

You can explore the demo of the application here: Fashion AI Chat App.

If you want to access the code and experiment with the app directly, you can use this Replit Template.

Prerequisites

Working of the Application

Here's an overview of how the application works and how its components interact:

__wf_reserved_inherit

The application has three main features:

  1. Embedding Generation and Insertion into the Milvus Vector Database

  2. Query Image to Video Segment Retrieval

  3. Chat Retrieval Augmented Generation (RAG) for Multimodal Retrieval (Text and Video Segments)

This tutorial application shows how to store product catalog data in a vector database using Twelve Labs Embed - Marengo-retrieval-2.7. The embeddings for both text and video content are stored in the same Milvus vector database collection. For clarity and reusability, the Twelve Labs embedding function is defined in each utility function.

To make the application work as expected, you must insert the embedding data into the collection, or you can refer to the sample data provided here used in this application.

Preparation Steps

  1. Obtain your API key from the Twelve Labs Playground and set up your environment variable.

  2. Clone the project from Github or use the Replit Template. We recommend using the Replit template since it comes with a predefined virtual environment. Remember to add the secret file and consult the secret key replit docs during setup.

  3. Create a .env file containing your Twelve Labs, Milvus Connection, and OpenAI API keys for the LLM model.

For the Milvus vector database, credentials are configured using Zilliz Cloud, though other connection methods are available in the Installation Guide.

TWELVELABS_API_KEY="your_twelvelabs_key"
COLLECTION_NAME="your_collection_name"
URL="your_milvus_url"
TOKEN="your_milvus_token"
OPENAI_API_KEY="your_openai_key"

Once you've completed these steps, you're ready to start developing!

Walkthrough for Fashion Chat Application

In this tutorial, we'll create a Streamlit application with a minimal frontend. Let's look at the directory structure:

├── pages/
├── add_product_page.py
└── visual_search.py
├── .gitignore
├── app.py
└── utils.py  
└── requirements.txt

Creating the Streamlit Application

Now that you've finished the setup, let's build the Streamlit application.

Find all required dependencies for creating your virtual environment here: requirements.txt

Create a virtual Python environment, then set up the environment for the application using the following command:

pip install -r requirements.txt

Defining the Utility function

This section describes the utility.py, which contains the core logic and implementation. Let's examine each subsection:

1 - Setting up the Connection

The first step loads environment variables and establishes connections with OpenAI for LLM access, Milvus for collection management, and Twelve Labs initialization.

# Load environment variables
COLLECTION_NAME = os.getenv('COLLECTION_NAME')
URL = os.getenv('URL')
TOKEN = os.getenv('TOKEN')
TWELVELABS_API_KEY = os.getenv('TWELVELABS_API_KEY')

# Initialize connections
openai_client = OpenAI()
connections.connect(uri=URL, token=TOKEN)
collection = Collection(COLLECTION_NAME)
collection.load()

twelvelabs_client = TwelveLabs(api_key=TWELVELABS_API_KEY)

2 - Generation of Embedding with Twelve Labs Embed API

This section covers embedding generation using the Twelve Labs SDK with Marengo-retrieval-2.7 for multimodal retrieval. The solution integrates text and video processing to create efficient search embeddings.

# Generate text and segmented video embeddings for a product
def generate_embedding(product_info):
    try:
        st.write("Starting embedding generation process...")
        st.write(f"Processing product: {product_info['title']}")
        
        # Initialize the Twelve Labs client
 twelvelabs_client = TwelveLabs(api_key=TWELVELABS_API_KEY)
        st.write("TwelveLabs client initialized successfully")
        
        st.write("Attempting to generate text embedding...")

        # Formatting the Text Data of Product Catalogue
        text = f"product type: {product_info['title']}. " \
               f"product description: {product_info['desc']}. " \
               f"product category: fashion apparel."
               
        st.write(f"Generating embedding for text: {text}")
        
        # Generating the Text Embeddings
        text_embedding = twelvelabs_client.embed.create(
            model_name="Marengo-retrieval-2.7",
            text=text
        ).text_embedding.segments[0].embeddings_float
        st.write("Text embedding generated successfully")
        
        # Create and wait for video embedding task
        st.write("Creating video embedding task...")
        video_task = twelvelabs_client.embed.task.create(
            model_name="Marengo-retrieval-2.7",
            video_url=product_info['video_url'],
            video_clip_length=6
        )
        
        def on_task_update(task):
            st.write(f"Video processing status: {task.status}")
        
        st.write("Waiting for video processing to complete...")
        video_task.wait_for_done(sleep_interval=2, callback=on_task_update)
        
        # Retrieve segmented video embeddings
        video_task = video_task.retrieve()
        if not video_task.video_embedding or not video_task.video_embedding.segments:
            raise Exception("Failed to retrieve video embeddings")
        
        video_segments = video_task.video_embedding.segments
        st.write(f"Retrieved {len(video_segments)} video segments")
        
        video_embeddings = []
        for segment in video_segments:
            video_embeddings.append({
                'embedding': segment.embeddings_float,
                'metadata': {
                    'scope': 'clip',
                    'start_time': segment.start_offset_sec,
                    'end_time': segment.end_offset_sec,
                    'video_url': product_info['video_url']
                }
            })
        
        return {
            'text_embedding': text_embedding,
            'video_embeddings': video_embeddings
        }, None
        
    except Exception as e:
        st.error("Error in embedding generation")
        st.error(f"Error message: {str(e)}")
        return None, str(e)

The process begins by structuring and formatting text to ensure consistent, meaningful embeddings. Video embedding follows, with status monitoring and progress tracking. Each video segment is set to 6 seconds (video_clip_length=6).

Metadata tracking is essential for maintaining traceability and enabling precise video segment and text searching. The results are then stored in the Milvus vector database collection.

3 - Embedding Insertion into the Milvus Vector Database

Vector databases efficiently store and retrieve high-dimensional embedding data. This section demonstrates how to insert both text and video embeddings into the Milvus vector database with proper metadata management and error handling.

# Insert text and all video segment embeddings into Milvus Collection
def insert_embeddings(embeddings_data, product_info):
    try:
        metadata = {
            "product_id": product_info['product_id'],
            "title": product_info['title'],
            "description": product_info['desc'],
            "video_url": product_info['video_url'],
            "link": product_info['link']
        }
        
        # Insert text embedding
        text_entry = {
            "id": int(uuid.uuid4().int & (1<<63)-1),
            "vector": embeddings_data['text_embedding'],
            "metadata": metadata,
            "embedding_type": "text"
        }
        collection.insert([text_entry])
        st.write("Text embedding inserted successfully")
        
        # Insert each video segment embedding
        for video_segment in embeddings_data['video_embeddings']:
            video_entry = {
                "id": int(uuid.uuid4().int & (1<<63)-1),
                "vector": video_segment['embedding'],
                "metadata": {**metadata, **video_segment['metadata']},
                "embedding_type": "video"
            }
            collection.insert([video_entry])
        
        st.write(f"Inserted {len(embeddings_data['video_embeddings'])} video segment embeddings")
        return True
        
    except Exception as e:
        st.error(f"Error inserting embeddings: {str(e)}")
        return False

After defining the id, vector, and metadata, the embedding_type is specified to accommodate different formats in the same collection space. The data is then inserted using the collection.insert method.

4 - Semantic Search Implementation for Image to Video Retrieval

Semantic search with image queries enables finding similar video content. This implementation combines Twelve Labs embedding capabilities with Milvus vector search. Here's how the utility function searched_similar_videos works:

__wf_reserved_inherit

First, the system loads and reads the product image to find video segments containing similar products. Twelve Labs embed then converts the image into an embedding for semantic search against stored video embeddings in the Milvus collection.

# Load and Read Query Image
image_path = "path/to/your/image.jpg" 
with open(image_path, 'rb') as f: 
    image_file = f.read()

# Intializing of client and Generation of Image Emebedding
twelvelabs_client = TwelveLabs(api_key=TWELVELABS_API_KEY)
image_embedding = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    image_file=image_file
).image_embedding.segments[0].embeddings_float

The search uses cosine similarity and cluster parameters to balance speed and accuracy. When collection.search runs with embedding_type == 'video' and other parameters, it retrieves the specified number of product video segments.

search_params = {
    "metric_type": "COSINE",
    "params": {
        "nprobe": 1024,
        "ef": 64
    }
}

# Search relevant video segments
results = collection.search(
    data=[image_embedding],
    anns_field="vector",
    param=search_params,
    limit=2,
    expr="embedding_type == 'video'",
    output_fields=["metadata"]
)

The system processes the results, converts similarity scores to percentages for clarity, and organizes the metadata into a structured dictionary.

search_results = []
for hits in results:
    for hit in hits:
        metadata = hit.metadata
        # Convert score from [-1,1] to [0,100] range
        similarity = round((hit.score + 1) * 50, 2)
        similarity = max(0, min(100, similarity))
        
        search_results.append({
            'Title': metadata.get('title', ''),
            'Description': metadata.get('description', ''),
            'Link': metadata.get('link', ''),
            'Start Time': f"{metadata.get('start_time', 0):.1f}s",
            'End Time': f"{metadata.get('end_time', 0):.1f}s",
            'Video URL': metadata.get('video_url', ''),
            'Similarity': f"{similarity}%",
            'Raw Score': hit.score
        })

# Sort by similarity score in descending order
search_results.sort(key=lambda x: float(x['Similarity'].rstrip('%')), reverse=True)

The sorted results can then be displayed. The UI implementation is available in vision_search.

5 - Developing Retrieval Augmented Generation System

This section focuses on question-answering in the fashion assistant context, combining multimodal retrieval with LLM (gpt-3.5) for contextual responses. It emphasizes multimodal retrieval from text and video data with metadata for effective personalization. The get_rag_response function implements the RAG functionality.

__wf_reserved_inherit

The process begins by converting the user's question and context into a text embedding using the Marengo-retrieval-2.7 engine with Twelve Labs embed, enabling semantic search across the database.

# Sample question
question_with_context = f"fashion product: Suggest me black dresses for a party"

# Generate embedding for the question
question_embedding = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    text=question_with_context
).text_embedding.segments[0].embeddings_float

After setting semantic search parameters, the system performs separate collection searches for text and video segments. Video segment retrieval finds the most relevant segments matching the text query, providing users with comprehensive exploration options.

The expr method refines search factors, while different limits apply to text and video searches. A typical search retrieves two text products with metadata and three relevant video segments.

# Configure search parameters
search_params = {
    "metric_type": "COSINE",
    "params": {
        "nprobe": 1024,
        "ef": 64
    }
}

# Search for relevant text embeddings
text_results = collection.search(
    data=[question_embedding],
    anns_field="vector",
    param=search_params,
    limit=2,  # Get top 2 text matches
    expr="embedding_type == 'text'",
    output_fields=["metadata"]
)

# Search for relevant video segments
video_results = collection.search(
    data=[question_embedding],
    anns_field="vector",
    param=search_params,
    limit=3,  # Get top 3 video segments
    expr="embedding_type == 'video'",
    output_fields=["metadata"]
)

During result processing, the system extracts metadata from each segment, converts similarity scores to percentages, and organizes product details in a structured dictionary.

# Process text results
text_docs = []
for hits in text_results:
    for hit in hits:
        metadata = hit.metadata
        similarity = round((hit.score + 1) * 50, 2)
        similarity = max(0, min(100, similarity))
        
        text_docs.append({
            "title": metadata.get('title', 'Untitled'),
            "description": metadata.get('description', 'No description available'),
            "product_id": metadata.get('product_id', ''),
            "video_url": metadata.get('video_url', ''),
            "link": metadata.get('link', ''),
            "similarity": similarity,
            "raw_score": hit.score,
            "type": "text"
        })

# Process video results
video_docs = []
for hits in video_results:
    for hit in hits:
        metadata = hit.metadata
        similarity = round((hit.score + 1) * 50, 2)
        similarity = max(0, min(100, similarity))
        
        video_docs.append({
            "title": metadata.get('title', 'Untitled'),
            "description": metadata.get('description', 'No description available'),
            "product_id": metadata.get('product_id', ''),
            "video_url": metadata.get('video_url', ''),
            "link": metadata.get('link', ''),
            "similarity": similarity,
            "raw_score": hit.score,
            "start_time": metadata.get('start_time', 0),
            "end_time": metadata.get('end_time', 0),
            "type": "video"
        })

After gathering relevant information, the system focuses on retrieval-augmented generation. The retrieved context is structured for efficient LLM processing.

The system defines both System and User Prompts. The System Prompt establishes the fashion assistant context, while the User Prompt combines the user's query with retrieved database context. The system includes error handling for cases where context retrieval fails.

# Handle case when no results found
if not text_docs and not video_docs:
    response_data = {
        "response": "I couldn't find any matching products. Try describing what you're looking for differently.",
        "metadata": None
    }
else:
    # Create context from text results for LLM
    text_context = "\n\n".join([
        f"Product: {doc['title']}\nDescription: {doc['description']}\nLink: {doc['link']}"
        for doc in text_docs
    ])

    # Prepare messages for OpenAI
    messages = [
        {
            "role": "system",
            "content": """You are a professional fashion advisor and AI shopping assistant.
            Organize your response in the following format:

            First, provide a brief, direct answer to the user's query
            Then, describe any relevant products found that match their request, including:
               - Product name and key features
               - Why this product matches their needs
               - Style suggestions for how to wear or use the item
            Finally, provide any additional style advice or recommendations
            
            Keep your response engaging and natural while maintaining this clear structure.
            Focus on being helpful and specific rather than promotional."""
        },
        {
            "role": "user",
            "content": f"""Query: {question}

Available Products:
{text_context}

Please provide fashion advice and product recommendations based on these options."""
        }
    ]

    # Get response from OpenAI
    chat_response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature=0.5,
        max_tokens=500
    )

    # Format final response
    response_data = {
        "response": chat_response.choices[0].message.content,
        "metadata": {
            "sources": text_docs + video_docs,
            "total_sources": len(text_docs) + len(video_docs),
            "text_sources": len(text_docs),
            "video_sources": len(video_docs)
        }
    }

The text_context contains the retrieved information. Using gpt-3.5-turbo, the system generates responses with explanations, suggestions, and user guidance as specified in the system prompt. The response is formatted in an accessible dictionary structure. For UI implementation details, see the app.

Demo Application

This demo has three main parts, each with a specific purpose: simplifying database insertion, enabling efficient multimodal retrieval, and improving search functionality for users.

  • Generation and insertion of text and video embeddings into the database

  • Retrieval of multimodal data through LLM chat interactions

  • Product image-based search for finding relevant video segments

The product data catalog includes fields such as product ID, text, description, link, and video URL. Text data is converted into embeddings, while other data is stored as metadata. For video segments, the system processes the video URL. All embeddings are then stored in a single collection.

__wf_reserved_inherit

For multimodal retrieval, here's an example query: "I'm looking for a men's black T-shirt"—

__wf_reserved_inherit

For semantic video search, when using a black t-shirt image as a query with a limit of top 2 results—

__wf_reserved_inherit

More Ideas to Experiment with the Tutorial

Understanding an application's workings and development process helps you create innovative products that serve users' needs. Here are some potential use cases for video content creators:

🛍️ E-commerce: Enhance product search and recommendations using text and image queries

🎵 Music Discovery: Find similar songs, artists, or genres based on audio clips and user preferences

🎥 Intelligent Video Search Engine: Find videos based on their visual and audio content, helping content creators, journalists, and researchers

🗺️ Personalized Travel Planner: Create travel itineraries based on user preferences, reviews, and destination data

📚Educational Resource Management: Organize and find learning materials—including documents, presentations, and videos—based on content needs

Conclusion

This tutorial demonstrates how multimodal retrieval works through embeddings, powered by Twelve Labs and Milvus vector database. This implementation shows developers how to build scalable applications that handle both visual and textual data. Thanks for following along—we welcome your ideas for improving user experience and solving various challenges.

Additional Resources

Learn more about the embedding generation engine—Marengo-retrieval-2.7. To explore Twelve Labs further and deepen your understanding of video content analysis, check out these resources:

  • Discord Community: Join our vibrant community of developers and enthusiasts to discuss ideas, ask questions, and share your projects. Join the Twelve Labs Discord

  • Sample Applications: Explore a variety of sample applications to inspire your next project or learn new implementation techniques

  • Explore Tutorials: Dive deeper into Twelve Labs capabilities with our comprehensive tutorials

We encourage you to leverage these resources to expand your knowledge and create innovative applications using Twelve Labs video understanding technology.

Introduction

Ever wondered what it would be like to have a personal fashion advisor available 24/7 who understands not just your words but also your visual preferences? Imagine showing a photo of an outfit you like and getting instant AI suggestions for similar styles, or describing your dream look and receiving matching outfit recommendations 👚

Our Fashion AI assistant application is built for next-generation product discovery. This tutorial explores our advanced fashion recommendation system that combines multimodal retrieval with conversational AI. The system uses TwelveLabs Embed - Marengo-retrieval-2.7 model to generate embeddings from both text and video content, enabling seamless search across different data types.

The application uses Milvus as its vector database for efficient similarity search and integrates OpenAI's gpt-3.5 for natural language processing. It handles fashion queries through both text and image inputs, finding relevant product descriptions and precise video timestamps to show users exactly where their desired items appear.

This powerful combination creates a system that understands both textual and visual queries while delivering contextualized responses enriched with multimodal content.

Let's explore how this application works and how you can build similar solutions using the TwelveLabs Python SDK and Milvus Python SDK.

You can explore the demo of the application here: Fashion AI Chat App.

If you want to access the code and experiment with the app directly, you can use this Replit Template.

Prerequisites

Working of the Application

Here's an overview of how the application works and how its components interact:

__wf_reserved_inherit

The application has three main features:

  1. Embedding Generation and Insertion into the Milvus Vector Database

  2. Query Image to Video Segment Retrieval

  3. Chat Retrieval Augmented Generation (RAG) for Multimodal Retrieval (Text and Video Segments)

This tutorial application shows how to store product catalog data in a vector database using Twelve Labs Embed - Marengo-retrieval-2.7. The embeddings for both text and video content are stored in the same Milvus vector database collection. For clarity and reusability, the Twelve Labs embedding function is defined in each utility function.

To make the application work as expected, you must insert the embedding data into the collection, or you can refer to the sample data provided here used in this application.

Preparation Steps

  1. Obtain your API key from the Twelve Labs Playground and set up your environment variable.

  2. Clone the project from Github or use the Replit Template. We recommend using the Replit template since it comes with a predefined virtual environment. Remember to add the secret file and consult the secret key replit docs during setup.

  3. Create a .env file containing your Twelve Labs, Milvus Connection, and OpenAI API keys for the LLM model.

For the Milvus vector database, credentials are configured using Zilliz Cloud, though other connection methods are available in the Installation Guide.

TWELVELABS_API_KEY="your_twelvelabs_key"
COLLECTION_NAME="your_collection_name"
URL="your_milvus_url"
TOKEN="your_milvus_token"
OPENAI_API_KEY="your_openai_key"

Once you've completed these steps, you're ready to start developing!

Walkthrough for Fashion Chat Application

In this tutorial, we'll create a Streamlit application with a minimal frontend. Let's look at the directory structure:

├── pages/
├── add_product_page.py
└── visual_search.py
├── .gitignore
├── app.py
└── utils.py  
└── requirements.txt

Creating the Streamlit Application

Now that you've finished the setup, let's build the Streamlit application.

Find all required dependencies for creating your virtual environment here: requirements.txt

Create a virtual Python environment, then set up the environment for the application using the following command:

pip install -r requirements.txt

Defining the Utility function

This section describes the utility.py, which contains the core logic and implementation. Let's examine each subsection:

1 - Setting up the Connection

The first step loads environment variables and establishes connections with OpenAI for LLM access, Milvus for collection management, and Twelve Labs initialization.

# Load environment variables
COLLECTION_NAME = os.getenv('COLLECTION_NAME')
URL = os.getenv('URL')
TOKEN = os.getenv('TOKEN')
TWELVELABS_API_KEY = os.getenv('TWELVELABS_API_KEY')

# Initialize connections
openai_client = OpenAI()
connections.connect(uri=URL, token=TOKEN)
collection = Collection(COLLECTION_NAME)
collection.load()

twelvelabs_client = TwelveLabs(api_key=TWELVELABS_API_KEY)

2 - Generation of Embedding with Twelve Labs Embed API

This section covers embedding generation using the Twelve Labs SDK with Marengo-retrieval-2.7 for multimodal retrieval. The solution integrates text and video processing to create efficient search embeddings.

# Generate text and segmented video embeddings for a product
def generate_embedding(product_info):
    try:
        st.write("Starting embedding generation process...")
        st.write(f"Processing product: {product_info['title']}")
        
        # Initialize the Twelve Labs client
 twelvelabs_client = TwelveLabs(api_key=TWELVELABS_API_KEY)
        st.write("TwelveLabs client initialized successfully")
        
        st.write("Attempting to generate text embedding...")

        # Formatting the Text Data of Product Catalogue
        text = f"product type: {product_info['title']}. " \
               f"product description: {product_info['desc']}. " \
               f"product category: fashion apparel."
               
        st.write(f"Generating embedding for text: {text}")
        
        # Generating the Text Embeddings
        text_embedding = twelvelabs_client.embed.create(
            model_name="Marengo-retrieval-2.7",
            text=text
        ).text_embedding.segments[0].embeddings_float
        st.write("Text embedding generated successfully")
        
        # Create and wait for video embedding task
        st.write("Creating video embedding task...")
        video_task = twelvelabs_client.embed.task.create(
            model_name="Marengo-retrieval-2.7",
            video_url=product_info['video_url'],
            video_clip_length=6
        )
        
        def on_task_update(task):
            st.write(f"Video processing status: {task.status}")
        
        st.write("Waiting for video processing to complete...")
        video_task.wait_for_done(sleep_interval=2, callback=on_task_update)
        
        # Retrieve segmented video embeddings
        video_task = video_task.retrieve()
        if not video_task.video_embedding or not video_task.video_embedding.segments:
            raise Exception("Failed to retrieve video embeddings")
        
        video_segments = video_task.video_embedding.segments
        st.write(f"Retrieved {len(video_segments)} video segments")
        
        video_embeddings = []
        for segment in video_segments:
            video_embeddings.append({
                'embedding': segment.embeddings_float,
                'metadata': {
                    'scope': 'clip',
                    'start_time': segment.start_offset_sec,
                    'end_time': segment.end_offset_sec,
                    'video_url': product_info['video_url']
                }
            })
        
        return {
            'text_embedding': text_embedding,
            'video_embeddings': video_embeddings
        }, None
        
    except Exception as e:
        st.error("Error in embedding generation")
        st.error(f"Error message: {str(e)}")
        return None, str(e)

The process begins by structuring and formatting text to ensure consistent, meaningful embeddings. Video embedding follows, with status monitoring and progress tracking. Each video segment is set to 6 seconds (video_clip_length=6).

Metadata tracking is essential for maintaining traceability and enabling precise video segment and text searching. The results are then stored in the Milvus vector database collection.

3 - Embedding Insertion into the Milvus Vector Database

Vector databases efficiently store and retrieve high-dimensional embedding data. This section demonstrates how to insert both text and video embeddings into the Milvus vector database with proper metadata management and error handling.

# Insert text and all video segment embeddings into Milvus Collection
def insert_embeddings(embeddings_data, product_info):
    try:
        metadata = {
            "product_id": product_info['product_id'],
            "title": product_info['title'],
            "description": product_info['desc'],
            "video_url": product_info['video_url'],
            "link": product_info['link']
        }
        
        # Insert text embedding
        text_entry = {
            "id": int(uuid.uuid4().int & (1<<63)-1),
            "vector": embeddings_data['text_embedding'],
            "metadata": metadata,
            "embedding_type": "text"
        }
        collection.insert([text_entry])
        st.write("Text embedding inserted successfully")
        
        # Insert each video segment embedding
        for video_segment in embeddings_data['video_embeddings']:
            video_entry = {
                "id": int(uuid.uuid4().int & (1<<63)-1),
                "vector": video_segment['embedding'],
                "metadata": {**metadata, **video_segment['metadata']},
                "embedding_type": "video"
            }
            collection.insert([video_entry])
        
        st.write(f"Inserted {len(embeddings_data['video_embeddings'])} video segment embeddings")
        return True
        
    except Exception as e:
        st.error(f"Error inserting embeddings: {str(e)}")
        return False

After defining the id, vector, and metadata, the embedding_type is specified to accommodate different formats in the same collection space. The data is then inserted using the collection.insert method.

4 - Semantic Search Implementation for Image to Video Retrieval

Semantic search with image queries enables finding similar video content. This implementation combines Twelve Labs embedding capabilities with Milvus vector search. Here's how the utility function searched_similar_videos works:

__wf_reserved_inherit

First, the system loads and reads the product image to find video segments containing similar products. Twelve Labs embed then converts the image into an embedding for semantic search against stored video embeddings in the Milvus collection.

# Load and Read Query Image
image_path = "path/to/your/image.jpg" 
with open(image_path, 'rb') as f: 
    image_file = f.read()

# Intializing of client and Generation of Image Emebedding
twelvelabs_client = TwelveLabs(api_key=TWELVELABS_API_KEY)
image_embedding = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    image_file=image_file
).image_embedding.segments[0].embeddings_float

The search uses cosine similarity and cluster parameters to balance speed and accuracy. When collection.search runs with embedding_type == 'video' and other parameters, it retrieves the specified number of product video segments.

search_params = {
    "metric_type": "COSINE",
    "params": {
        "nprobe": 1024,
        "ef": 64
    }
}

# Search relevant video segments
results = collection.search(
    data=[image_embedding],
    anns_field="vector",
    param=search_params,
    limit=2,
    expr="embedding_type == 'video'",
    output_fields=["metadata"]
)

The system processes the results, converts similarity scores to percentages for clarity, and organizes the metadata into a structured dictionary.

search_results = []
for hits in results:
    for hit in hits:
        metadata = hit.metadata
        # Convert score from [-1,1] to [0,100] range
        similarity = round((hit.score + 1) * 50, 2)
        similarity = max(0, min(100, similarity))
        
        search_results.append({
            'Title': metadata.get('title', ''),
            'Description': metadata.get('description', ''),
            'Link': metadata.get('link', ''),
            'Start Time': f"{metadata.get('start_time', 0):.1f}s",
            'End Time': f"{metadata.get('end_time', 0):.1f}s",
            'Video URL': metadata.get('video_url', ''),
            'Similarity': f"{similarity}%",
            'Raw Score': hit.score
        })

# Sort by similarity score in descending order
search_results.sort(key=lambda x: float(x['Similarity'].rstrip('%')), reverse=True)

The sorted results can then be displayed. The UI implementation is available in vision_search.

5 - Developing Retrieval Augmented Generation System

This section focuses on question-answering in the fashion assistant context, combining multimodal retrieval with LLM (gpt-3.5) for contextual responses. It emphasizes multimodal retrieval from text and video data with metadata for effective personalization. The get_rag_response function implements the RAG functionality.

__wf_reserved_inherit

The process begins by converting the user's question and context into a text embedding using the Marengo-retrieval-2.7 engine with Twelve Labs embed, enabling semantic search across the database.

# Sample question
question_with_context = f"fashion product: Suggest me black dresses for a party"

# Generate embedding for the question
question_embedding = twelvelabs_client.embed.create(
    model_name="Marengo-retrieval-2.7",
    text=question_with_context
).text_embedding.segments[0].embeddings_float

After setting semantic search parameters, the system performs separate collection searches for text and video segments. Video segment retrieval finds the most relevant segments matching the text query, providing users with comprehensive exploration options.

The expr method refines search factors, while different limits apply to text and video searches. A typical search retrieves two text products with metadata and three relevant video segments.

# Configure search parameters
search_params = {
    "metric_type": "COSINE",
    "params": {
        "nprobe": 1024,
        "ef": 64
    }
}

# Search for relevant text embeddings
text_results = collection.search(
    data=[question_embedding],
    anns_field="vector",
    param=search_params,
    limit=2,  # Get top 2 text matches
    expr="embedding_type == 'text'",
    output_fields=["metadata"]
)

# Search for relevant video segments
video_results = collection.search(
    data=[question_embedding],
    anns_field="vector",
    param=search_params,
    limit=3,  # Get top 3 video segments
    expr="embedding_type == 'video'",
    output_fields=["metadata"]
)

During result processing, the system extracts metadata from each segment, converts similarity scores to percentages, and organizes product details in a structured dictionary.

# Process text results
text_docs = []
for hits in text_results:
    for hit in hits:
        metadata = hit.metadata
        similarity = round((hit.score + 1) * 50, 2)
        similarity = max(0, min(100, similarity))
        
        text_docs.append({
            "title": metadata.get('title', 'Untitled'),
            "description": metadata.get('description', 'No description available'),
            "product_id": metadata.get('product_id', ''),
            "video_url": metadata.get('video_url', ''),
            "link": metadata.get('link', ''),
            "similarity": similarity,
            "raw_score": hit.score,
            "type": "text"
        })

# Process video results
video_docs = []
for hits in video_results:
    for hit in hits:
        metadata = hit.metadata
        similarity = round((hit.score + 1) * 50, 2)
        similarity = max(0, min(100, similarity))
        
        video_docs.append({
            "title": metadata.get('title', 'Untitled'),
            "description": metadata.get('description', 'No description available'),
            "product_id": metadata.get('product_id', ''),
            "video_url": metadata.get('video_url', ''),
            "link": metadata.get('link', ''),
            "similarity": similarity,
            "raw_score": hit.score,
            "start_time": metadata.get('start_time', 0),
            "end_time": metadata.get('end_time', 0),
            "type": "video"
        })

After gathering relevant information, the system focuses on retrieval-augmented generation. The retrieved context is structured for efficient LLM processing.

The system defines both System and User Prompts. The System Prompt establishes the fashion assistant context, while the User Prompt combines the user's query with retrieved database context. The system includes error handling for cases where context retrieval fails.

# Handle case when no results found
if not text_docs and not video_docs:
    response_data = {
        "response": "I couldn't find any matching products. Try describing what you're looking for differently.",
        "metadata": None
    }
else:
    # Create context from text results for LLM
    text_context = "\n\n".join([
        f"Product: {doc['title']}\nDescription: {doc['description']}\nLink: {doc['link']}"
        for doc in text_docs
    ])

    # Prepare messages for OpenAI
    messages = [
        {
            "role": "system",
            "content": """You are a professional fashion advisor and AI shopping assistant.
            Organize your response in the following format:

            First, provide a brief, direct answer to the user's query
            Then, describe any relevant products found that match their request, including:
               - Product name and key features
               - Why this product matches their needs
               - Style suggestions for how to wear or use the item
            Finally, provide any additional style advice or recommendations
            
            Keep your response engaging and natural while maintaining this clear structure.
            Focus on being helpful and specific rather than promotional."""
        },
        {
            "role": "user",
            "content": f"""Query: {question}

Available Products:
{text_context}

Please provide fashion advice and product recommendations based on these options."""
        }
    ]

    # Get response from OpenAI
    chat_response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature=0.5,
        max_tokens=500
    )

    # Format final response
    response_data = {
        "response": chat_response.choices[0].message.content,
        "metadata": {
            "sources": text_docs + video_docs,
            "total_sources": len(text_docs) + len(video_docs),
            "text_sources": len(text_docs),
            "video_sources": len(video_docs)
        }
    }

The text_context contains the retrieved information. Using gpt-3.5-turbo, the system generates responses with explanations, suggestions, and user guidance as specified in the system prompt. The response is formatted in an accessible dictionary structure. For UI implementation details, see the app.

Demo Application

This demo has three main parts, each with a specific purpose: simplifying database insertion, enabling efficient multimodal retrieval, and improving search functionality for users.

  • Generation and insertion of text and video embeddings into the database

  • Retrieval of multimodal data through LLM chat interactions

  • Product image-based search for finding relevant video segments

The product data catalog includes fields such as product ID, text, description, link, and video URL. Text data is converted into embeddings, while other data is stored as metadata. For video segments, the system processes the video URL. All embeddings are then stored in a single collection.

__wf_reserved_inherit

For multimodal retrieval, here's an example query: "I'm looking for a men's black T-shirt"—

__wf_reserved_inherit

For semantic video search, when using a black t-shirt image as a query with a limit of top 2 results—

__wf_reserved_inherit

More Ideas to Experiment with the Tutorial

Understanding an application's workings and development process helps you create innovative products that serve users' needs. Here are some potential use cases for video content creators:

🛍️ E-commerce: Enhance product search and recommendations using text and image queries

🎵 Music Discovery: Find similar songs, artists, or genres based on audio clips and user preferences

🎥 Intelligent Video Search Engine: Find videos based on their visual and audio content, helping content creators, journalists, and researchers

🗺️ Personalized Travel Planner: Create travel itineraries based on user preferences, reviews, and destination data

📚Educational Resource Management: Organize and find learning materials—including documents, presentations, and videos—based on content needs

Conclusion

This tutorial demonstrates how multimodal retrieval works through embeddings, powered by Twelve Labs and Milvus vector database. This implementation shows developers how to build scalable applications that handle both visual and textual data. Thanks for following along—we welcome your ideas for improving user experience and solving various challenges.

Additional Resources

Learn more about the embedding generation engine—Marengo-retrieval-2.7. To explore Twelve Labs further and deepen your understanding of video content analysis, check out these resources:

  • Discord Community: Join our vibrant community of developers and enthusiasts to discuss ideas, ask questions, and share your projects. Join the Twelve Labs Discord

  • Sample Applications: Explore a variety of sample applications to inspire your next project or learn new implementation techniques

  • Explore Tutorials: Dive deeper into Twelve Labs capabilities with our comprehensive tutorials

We encourage you to leverage these resources to expand your knowledge and create innovative applications using Twelve Labs video understanding technology.