Twelve Labs is excited to announce Marengo 2.7, a new state-of-the-art multimodal embedding model that achieves over 15% improvement over its predecessor Marengo 2.6.
β
Unlike text, where a single word embedding can effectively capture semantic meaning, video content is inherently more complex and multifaceted. A video clip simultaneously contains visual elements (objects, scenes, actions), temporal dynamics (motion, transitions), audio components (speech, ambient sounds, music), and often textual information (overlays, subtitles). Traditional single-vector approaches struggle to effectively compress all these diverse aspects into one representation without losing critical information. This complexity necessitates a more sophisticated approach to video understanding.
To address this complexity, Marengo 2.7 uses a unique multi-vector approach. Instead of compressing everything into a single vector, it creates separate vectors for different aspects of the video. One vector might capture what things look like (e.g., "a man in a black shirt"), another tracks movement (e.g., "waving his hand"), and another remembers what was said (e.g., "video foundation model is fun"). This approach helps the model better understand videos that contain many different types of information, leading to more accurate video analysis across all aspects - visual, motion, and audio.
β
Existing benchmarks for video understanding models often rely on detailed, narrative-style descriptions that capture the main events in a video. However, this approach doesn't reflect real-world usage patterns, where users typically make shorter, more ambiguous queries like "find the red car" or "show me the celebration scene." Users also frequently search for peripheral details, background elements, or specific objects that may only appear briefly. Additionally, queries often combine multiple modalities - visual elements with audio cues, or text overlays with specific actions. This disconnect between benchmark evaluation and actual use cases necessitated a more comprehensive evaluation approach for Marengo 2.7.
Understanding the limitations of existing benchmarks in capturing real-world use cases, we developed an extensive evaluation framework encompassing over 60 diverse datasets. This framework rigorously tests the model's capabilities across:
β
Marengo 2.7 demonstrates state-of-the-art performance across all main benchmarks, with particularly remarkable achievements in image-to-visual search capabilities. While the model shows strong performance across all metrics, its performance in image object search and image logo search represent a significant leap forward in the field.
β
Building upon the success of Marengo 2.6, our latest Marengo 2.7 video foundation model represents a significant advancement in multimodal video understanding. We also introduce a novel multi-vector approach that enables more precise and comprehensive video content analysis.
β
At its core, Marengo-2.7 employs a Transformer-based architecture that processes video content through a single unified framework, capable of understanding:
A core element of Marengo-2.7 is its unique multi-vector representation. Unlike Marengo-2.6 that compress all information into a single embedding, Marengo-2.7 decomposes the raw inputs into multiple specialized vectors. Each vector independently captures distinct aspects of the video content - from visual appearance and motion dynamics to OCR text and speech patterns. This granular representation enables more nuanced and accurate multimodal search capabilities. The approach shows particular strength in detecting small objects while maintaining exceptional performance in general text-based search tasks.
β
Training for Marengo 2.7 focuses on self-supervised learning with contrastive loss on a comprehensive multimodal dataset. Based on our business and customerβs needs, we have carefully curated and augmented a dataset that's beneficial for training the model, encompassing:
Additionally, we enhanced our training data through re-captioning with Pegasus, our own large video-language model. This process generated high-quality descriptions that capture world knowledge and complex motion and spatio-temporal relationships. This is based on the insight that text descriptions are extremely important in training a very robust model (Fan et.al., LaCLIP, 2023.10 and Gu et. al., RWKV-CLIP, 2024.06).
The comprehensive training data allows Marengo 2.7 to develop a robust understanding across domains and modalities. With its extensive video content, the model learns sophisticated temporal relationships and cross-modal interactions.
β
The performance of Marengo 2.7 has been extensively evaluated against leading multimodal retrieval models and specialized solutions across multiple domains on 60+ benchmark datasets. Our evaluation framework encompasses text-to-visual, image-to-visual, and text-to-audio search capabilities, providing a comprehensive assessment of the model's multimodal understanding.
β
We selected these strong baseline models for comparison:
β
The evaluation framework utilizes diverse datasets:
β
β
β
To ensure transparency and reproducibility, we will open-source our comprehensive evaluation framework for video retrieval. While our current evaluation datasets are primarily machine-generated and effectively demonstrate performance trends, they require further refinement and human validation before public release. We are actively working on polishing these datasets to ensure they meet the high standards expected for public research use.
β
β
In general visual search, Marengo 2.7 achieves a 74.9% average recall across two benchmark datasets. These results represent a 4.7% improvement over Marengo 2.6 and a 4.6% advantage over external SOTA models.
β
In motion search, Marengo 2.7 achieves a 78.1% average recall in Something Something v2 These results represent a 22.5% improvement over Marengo 2.6 and a 30.0% advantage over external SOTA models.
β
In OCR search, Marengo 2.7 achieves a 77.0% mean average precision across two benchmark datasets. This represents a 10.1% improvement over Marengo 2.6 and a 13.4% advantage over external SOTA models.
β
In small object search, Marengo 2.7 achieves a 52.7% average recall across three custom benchmark datasets. These results represent a 10.14% improvement over Marengo 2.6 and a 10.08% advantage over external SOTA models.
β
β
In object search, Marengo 2.7 achieves a 90.6% average recall across three benchmark datasets. This result shows improvements over both Marengo 2.6 (32.6% increase) and external SOTA models (35.0% increase).
β
In logo search, Marengo 2.7 achieves an average mean average precision of 56.0% across three benchmark datasets. This represents a 31.8% improvement over its predecessor and a 19.2% advantage over external SOTA models. Note that the Logo Expert Model mentioned above is Google Cloud Vision API - Detect Logos.
β
β
In general audio search, Marengo 2.7 achieves a 57.7% average recall across three benchmark datasets. This represents a 7.7% improvement over Marengo 2.6.
β
To illustrate Marengo 2.7's capabilities across different search modalities, we present several representative examples that showcase its real-world performance.
These examples demonstrate Marengo 2.7's ability to handle diverse query types while maintaining high precision across different modalities.
β
While Marengo 2.7 demonstrates significant improvements across multiple modalities, several challenges remain in achieving comprehensive video understanding.
β
While the excels at identifying primary actions and objects, it may miss subtle background activities or parallel events that occur simultaneously in the video.
β
The model can sometimes struggle with finding exact visual matches, particularly when searching for specific instances of objects or people that may appear multiple times in slightly different contexts.
β
While Marengo 2.7 handles most queries effectively, it can encounter challenges with:
β
Additionally, Marengo 2.7 shows limitations in text-to-logo search scenarios, particularly when dealing with logos occupy less than 1% of the frame or appear in challenging viewing angles.
In conversation and OCR search, the model struggles with heavily accented speech, overlapping conversations, and text in unusual fonts or orientations. These challenges are particularly evident in real-world scenarios with poor lighting conditions or complex backgrounds.
These limitations represent natural areas for future research and development as we continue to advance the capabilities of multimodal video understanding. Our ongoing work focuses on addressing these challenges while maintaining the model's current strengths in cross-modal understanding and temporal reasoning.
β
Marengo 2.7 represents a significant leap forward in multimodal video understanding, demonstrating substantial improvements across visual, audio, and text modalities. Through its innovative multi-vector approach and comprehensive evaluation framework, we've shown that it's possible to achieve state-of-the-art performance in complex video understanding tasks while maintaining high precision across different use cases.
To support transparency and reproducibility in the field, we will be releasing a detailed technical report along with our comprehensive evaluation framework. This framework, which includes testing across 60+ datasets, will be open-sourced and regularly maintained to enable researchers and practitioners to validate our results and contribute to advancing multimodal video understanding.
β
This is a joint team effort across multiple functional groups including science, engineering, product, business development, and operation. It is co-authored by the Marengo team under Twelve Labs Research Science division.
β
Introducing our new Embed API in Open Beta, enabling customers to generate state-of-the-art multimodal embeddings.
See how video foundation models can radically accelerate your film making timeline.
Learn how to build a semantic video search engine with the powerful integration of Twelve Labs' Embed API with ApertureDB for advanced semantic video search.
Whether you're looking to find the perfect berry-toned lipstick or just curious about spotting specific colors in your videos, this guide will help you leverage cutting-edge AI to do so effortlessly.