‍
Twelve Labs, a San Francisco Bay Area-based AI research and product company, is at the forefront of multimodal video understanding. Today, we are thrilled to unveil the state-of-the-art video-to-text generation capabilities of Pegasus-1, our latest video-language foundation model. This represents our commitment to offer a comprehensive suite of APIs tailored for various downstream video understanding tasks. Our suite spans from natural language-based video moment retrieval to classification, and now, with the latest release, prompt-based video-to-text generation.
‍
Video data is intriguing as it contains multiple modalities within a single format. We believe that video understanding requires a novel take on marrying the intricacies of visual perception and the sequential and contextual nuances of audio and text. With the rise of capable image and language models, the dominant approach for video understanding has been to reframe it as an image or speech understanding problem. A typical framework would involve sampling frames from the video and inputting them into a vision-language model.
While this approach may be viable for short videos (which is why most vision-language models  focus on < 1min video clips), most real-world videos exceed 1 minute and can easily extend to hours. Using a vanilla "image-first" approach on such videos would mean processing tens of thousands of images for each video, which result in having to manipulate on a vast number of image-text embeddings that loosely capture the semantics in spatiotemporal information at best. This is impractical in many applications in terms of performance, latency, and cost. Furthermore, the dominant methodology overlooks the multimodal nature of videos, wherein the joint analysis of both visual and auditory elements, including speech, is crucial for a comprehensive understanding of their content.
With the fundamental properties of video data in mind, Twelve Labs has adopted a “Video First” strategy, focusing our model, data, and ML systems solely to processing and understanding video data. This stands in contrast to the prevalent “Language/Image First” approach observed among many Generative AI players. Four central principles underscore our “Video First” ethos, guiding both the design of our video-language foundation models and the architecture of our ML system:
‍
With a single API call, developers can prompt Pegasus-1 model to produce specific text outputs from their video data. Contrary to existing solutions that either utilizes speech-to-text conversions or rely solely on visual frame data, Pegasus-1 integrates visual, audio, and speech information to generate more holistic text from videos, achieving the new state-of-the-art performance in video summarization benchmarks. (See Evaluation and Results section below.)
The Gist and Summary APIs are pre-loaded with relevant prompts to work out of the box without needing user prompts. The Gist API can produce concise text outputs like titles, topics, and lists of relevant hashtags. The Summary API is designed to generate video summaries, chapters, and highlights. For customized outputs, the experimental Generate API allows users to prompt specific formats and styles, from bullet points to reports and even creative lyrics based on the content of the video.
‍
Example 1: Generating a small report from a video through the Gist and Summary APIs.
‍
Example 2: Generating a video summary by passing in a styling prompt to the Summary API.
‍
Example 3: Generating a customized text output by prompting through the experimental Generate API.
‍
Example 4: Demonstrating multimodal understanding that incorporates visual, speech, and auditory cues inside a video. (Highlighted in Green:Â Visual Information)
‍
Pegasus-1 model is structured around three primary components, each tasked with generating video-native embeddings, video-language-aligned embeddings, and text outputs respectively.
1. Video Encoder Model - Derived from our existing Marengo embedding model‍
2. Video-language Alignment Model
3. Large Language Model - Decoder model
‍
The Pegasus-1 model has an aggregate of approximately 80B parameters. Detailed parameter distributions for individual components, including the size of the Marengo embedding model, are not disclosed at this time.
‍
‍
As one would expect, the overall model performance is strongly correlated with that of each component. The degree of how much each constituent model influences the overall quality remains an open question. We will be conducting an extensive ablation study to have a better understanding and share our findings in the future.
‍
Twelve Labs acknowledges the importance of ensuring the responsible deployment of advanced technologies, including our Pegasus-1 model. We are dedicated to developing a comprehensive and transparent dataset and evaluation framework that benchmarks all of our models across fine-grained categories including correctness, detail orientation, contextual understanding, safety, and helpfulness. We are in the process of developing metrics specifically for safety and helpfulness in video-language models, with results to be shared imminently. We are excited to present the preliminary findings in this blog with a more detailed report to be released in the future. The evaluation is based on the preview version of Pegasus-1.
Our evaluation codebase can be found here.
‍
We compare our model against three distinct model (or product) categories:
‍
MSR-VTT Dataset (Xu et al., 2016): MSR-VTT is a widely used video-caption dataset that assesses a model's ability to generate a description or caption for short video clips ranging from 10 to 40 seconds. Each video is annotated with 20 sentences by human annotators. To capture as much detail as possible, we use an LLM (ChatGPT) to merge the 20 individual captions into a single, dense description. Our evaluation is conducted on the JSFusion Test Split, which consists of 1,000 video clips.
Video-ChatGPT Video Descriptions Dataset (Maaz et al., 2023): Video-to-text evaluations predominantly use video captioning datasets datasets, such as MSR-VTT dataset (above). While these short video descriptions offer a point of reference, they fall short in evaluating text generation for long-form videos prevalent in real-world contexts. Â Given this, we conduct an additional evaluation on our model on Video-ChatGPT Video Description Dataset. This dataset includes 500 videos from ActivityNet, all accompanied by comprehensive human-annotated summaries. Unlike traditional captioning datasets, the videos in this dataset range from 30 seconds to several minutes, and each comes with a dense, 5-8 sentence summary encapsulating both visual and auditory elements.
‍
Following the Quantitative Evaluation Framework for Video-based Conversation Models (QEFVC)(Maaz et al., 2023), we evaluate the models on three areas: Correctness of Information, Detail Orientation, and Contextual Understanding. To do this, we ask an instruction-tuned language model (e.g. GPT-4) about each criterion with respect to the reference summary. To quantify overall performance, we average the three scores, defining it as QEFVC quality score.
While this evaluation metric offers a convenient means for comparison with existing models, they are not without challenges. Past works on language model evaluation have highlighted that solely relying on GPT-4 for model prediction scores is prone to inaccurate evaluation. It is also observed that making the evaluation as fine-grained as possible improves the consistency as well as the precision of the evaluation (Ye et al., 2023). With this in mind, and drawing inspiration from FActScore (Min et al., 2023), we introduce a refined evaluation method, VidFactScore (Video Factual Score) to evaluate the quality of video summaries in a much more fine-grained manner:
In a comparison with the current state-of-the-art model (VideoChatGPT), Pegasus-1 exhibits a significant 61% relative improvement on the MSR-VTT dataset and 47% enhancement on the Video Description dataset as measured by the QEFVC Quality Score. Against the ASR+LLM model cohort (including models such as Whisper+ChatGPT and Vendor A), the performance gap widens with Pegasus-1 outperforming by 79% on MSR-VTT and 188% on the Video Description dataset.
Evaluating on our newly proposed VidFactScore-F1 metric, Pegasus-1 shows a 20% absolute increase on the MSR-VTT Dataset and 14% enhancement on the Video Descriptions dataset compared to VideoChatGPT. When benchmarked against the ASR+LLM model cohort, the gains are 25% on the MSR-VTT Dataset and 33% on the Video Descriptions dataset. These findings consistently suggest that VidFactScore aligns well and correlates strongly with evaluations based on the QEFVC framework.
An intriguing observation was made regarding videos predominantly reliant on speech, such as standup routines or lectures. Our model outperforms the ASR+LLM models even in these scenarios. Contrary to initial perceptions that ASR might suffice for such videos, our findings suggest otherwise. We postulate that even minimal visual cues (for instance, "a man performing standup comedy" or “a reaction video”) can enrich speech data, yielding more precise and encompassing summaries. Such outcomes underscore the notion that video comprehension surpasses mere speech understanding. It is clear that incorporating both visual and speech modalities is necessary to achieve a comprehensive understanding. See the “Reaction Video” in the “In-the-wild Examples” section below.
‍
These are sample examples that are randomly selected from diverse domains to illustrate the capabilities of Pegasus-1 compared to the existing approaches.
Note that the generated output may contain
We welcome any feedback and will do our best to address them in the near future.
‍
Video Duration: Our API supports videos ranging from 15 seconds to 1 hour in length (30 minutes for the Preview release), with plans to expand this to span from 4 seconds to 2 hours. For now, we recommend videos that fall within the 5 to 15-minute range.
Types of Video Content-: While Pegasus-1 was trained on diverse videos, we recommend using it on videos with a consistent theme or topic. For example, a video compilation of unrelated segments might disrupt the model’s ability to generate a consistent summary.
Hallucinations: Pegasus-1 shows promising initial results and offers a new paradigm of tackling video understanding tasks. However, it can occasionally produce inaccurate outputs. Its reliability remains an area for improvement. Efforts are underway to address  hallucinations and we expect significant progress in the near future.
Safety & Biases: Despite our diligent efforts in data selection and cleaning, the possibility of NSFW content and unintentional biases persists in the model. Our understanding of ethical and safety measures for video foundation models is ongoing. As we continue testing and gather feedback, a detailed evaluation and ethics report will be made available.
Chat Capability: Pegasus-1 is engineered mainly for extracting factual information from videos, not for general-purpose video chats. Our experimental Generate API accepts user prompts and offers limited chat functionality, though it isn't specifically tailored for this purpose.
Semantic Reliability: Pegasus-1 may generate completely different output even if the given prompts are semantically similar. We acknowledge the issue and will be mitigating this in the future works.
‍
Twelve Labs is proud to introduce Pegasus-1. This reflects our commitment to advancing the field of multimodal video understanding. Our video-language foundation model offers a pioneering approach to generative tasks related to video content. It is a meaningful first step towards achieving our mission of making videos just as easy as text.
We are a group of friendly, curious, and passionate people from all walks of life with a vision of driving the technological singularity for the betterment of humanity.
More coming soon.
‍
This is a joint team effort across multiple functional groups including product, engineering, science, and business development. (First-name alphabetical order)
Product & Design: Andrei Popescu, Anthony Giuliani, EK Yoon, James Le, Jenna Kang, Soyoung Lee, Sue Kim (intern), Sunny Nguyen, Travis Couture
Engineering: Abraham Jo, Ed Park, Esther Kim, Kevin Han, Noah Seo, Ryan Won, SJ Kim, Tony Moon
Model & Data: Aiden Lee, Daniel Kim, Jae Lee, Jay Suh, Jay Yi, Jeff Kim, Kael Lee (intern), Laine Kong (intern), Lucas Lee, Mars Ha, Minjoon Seo, Ray Jung, Shane Hong (intern), William Go
Operations & Support: Dave Chung, Hans Yoon, Jenny Ahn
‍
If you'd like to cite this post in your work, please use the following BibTeX citation:
We are excited to announce Marengo 2.7 - a breakthrough in video understanding powered by our innovative multi-vector embedding architecture!
Introducing our new Embed API in Open Beta, enabling customers to generate state-of-the-art multimodal embeddings.
We had fun interacting with the AI community in Denver!
Twelve Labs co-hosted our first in-person hackathon in Los Angeles!