Check out the Pegasus-1 technical report on arXiV and HuggingFace!
At Twelve Labs, our goal is to advance video understanding through the creation of innovative multimodal AI models. In a previous post, "Introducing Video-to-Text and Pegasus-1 (80B)," we introduced the alpha version of Pegasus-1. This foundational model can generate descriptive text from video input. Today, we're thrilled to announce the launch of Pegasus-1's open beta.
Pegasus-1 is designed to understand and articulate complex video content, transforming how we interact with and analyze multimedia. With roughly 17 billion parameters, this model is a significant progression in multimodal AI. It can process and generate language from video input with exceptional accuracy and detail.
In this update, we'll explore the enhancements made to Pegasus-1 since its alpha release. These include improvements in data quality, video processing, and training methods. We will also share benchmarking results against leading commercial and open-source models, demonstrating Pegasus-1's superior performance in video summarization, question answering, and conversations. Beyond quantitative metrics, Pegasus-1 also shows qualitative improvements through its enhanced world knowledge and ability to capture detailed visual information.
As a quick recap, Pegasus-1 is a multimodal foundation model designed to bridge the gap between video content and language, enabling machines to interpret and generate text based on video input. The architecture of Pegasus-1 is composed of three main components:
Compared to the alpha version, the open-beta version of Pegasus-1 boasts approximately 17B parameters, making it a compact yet powerful tool for interpreting and generating text based on video data.
As we progress from the alpha to the open-beta version, we continue to refine and enhance the model to deliver even more accurate and detailed video-language understanding. These enhancements are driven by three key factors: high-quality data, optimized video processing, and refined training techniques.
In accordance with previous findings, we also find that the quality and granularity of the captions had a more significant impact on the model's performance than the sheer quantity of data. For instance, Pegasus-1, trained on 100,000 high-quality video-text pairs, easily outperforms the same architecture trained on a much larger dataset (10M+) with lower-quality captions.
With this experimental evidence in mind, we have designed an efficient data annotation pipeline to create high-quality video captions for the aforementioned 10M+ videos. Training on such a large amount of high quality video-text pairs, Pegasus attains foundational video understanding capabilities that are not observed in other models.
We made substantial changes to the video processing pipeline to optimize both spatial and temporal resolution. We increased the number of patches per frame by 10x (spatial) and the number of frames by 1.5x (temporal), resulting in a 15x increase in the total number of patches per video. This enhancement allows Pegasus-1 to capture and convey more information per frame.
Pegasus-1 is now also able to better grasp the overall story and context of the video in a more coherent manner, as evidenced by both qualitative and quantitative analyses, particularly in question-answering datasets.
As a multimodal foundation model, Pegasus-1 is trained on massive multimodal datasets over multiple stages. However, multi-stage training often suffers from the phenomenon known as catastrophic forgetting. This occurs when a model, upon learning new information, rapidly forgets the old information it was previously trained on. This issue becomes even more pressing in multimodal models that undergo sequential training across modalities.
To address this, we employ a strategic training regimen that involves multiple stages - each meticulously designed to balance the acquisition of new knowledge with the preservation of previously learned information. The key to this approach lies in the selective updates (unfreezing) of model parameters and the careful adjustment of learning rates throughout the training process.
The open-beta version of Pegasus-1, compared to the alpha version, boasts enhanced capabilities, such as a heightened ability to capture fine-grained temporal moments and a reduced propensity for hallucination, resulting in increased robustness across diverse video domains. It also demonstrates expanded world knowledge and an improved capacity to list various moments in temporal order, rather than concentrating on a singular scene.
In our comprehensive benchmarking efforts, Pegasus-1 has been evaluated against a spectrum of both commercial and open-source models. This section will detail the performance of Pegasus-1 in comparison to these models across various video-language modeling tasks.
The baseline models against which Pegasus-1 was benchmarked include:
We specifically excluded image-based vision-language models such as LLaVA and GPT-4V from our comparison because they lack native video processing capabilities, which is a critical requirement for the tasks we are evaluating. Specifically, here are their limitations.
In Video Question Answering tasks, Pegasus-1's zero-shot performance on both ActivityNet-QA and NExT-QA datasets is particularly noteworthy. Pegasus-1 demonstrates a remarkable ability to generalize and understand various videos to accurately answer video-related questions without task-specific training.
The Video-ChatGPT Benchmark (also known as QEFVC) results highlight Pegasus-1's adeptness in handling video conversations. Pegasus-1 leads the pack with scores that reflect its proficiency in Correctness, Detail, Context, Temporal understanding, and Consistency. Notably, Pegasus-1 scored 3.79 in Correctness and 4.29 in Detail, showcasing its nuanced grasp of video conversations and the context in which they occur.
Lastly, Pegasus-1 has demonstrated superior performance on summarizing videos. We compare Pegasus-1 against its competitors on MSR-VTT Dataset, using VideoChatGPT Benchmark’s scoring scheme (”Temporal Understanding” and “Consistency” are omitted due to the nature of summarization). As shown above, Pegasus-1 outperforms the baseline models all metrics by a significant margin.
Through these benchmarks, Pegasus-1 has established itself as a formidable contender in the video-language modeling arena, setting new standards for zero-shot performance and generalization in video understanding tasks.
These are sample examples that are randomly selected from diverse domains to illustrate the capabilities of Pegasus-1.
Safety & Biases: Pegasus-1 is designed with safety mechanisms; however, as with any AI model, there is a risk of generating content that could be considered harmful or inappropriate without proper oversight and regulation. Our understanding of ethical and safety measures for video foundation models is ongoing. As we continue testing and gather feedback, a detailed evaluation and ethics report will be made available.
Video Duration: Our API supports videos ranging from 4 seconds to 20 minutes in length. This constraint is due to computational and memory considerations, which are common challenges in handling large-scale video data. As a result, users may need to segment longer videos into smaller parts to fully leverage the model's capabilities. We will work on natively supporting longer duration in future releases.
Hallucinations: Pegasus-1 can occasionally produce inaccurate outputs. While we have improved it from our alpha version to reduce hallucination, users should be mindful of this limitation, especially for tasks where high precision is required and factual correctness is critical.
The journey of Pegasus-1 from its alpha to the beta version has been marked by significant enhancements. The meticulous improvements in training data quality, video processing capabilities, and advanced training techniques have culminated in a model that not only understands video content more deeply but also interacts in a conversational context with a level of sophistication that was previously unattainable.
The benchmark results speak for themselves, placing Pegasus-1 at the forefront of the industry, outperforming established models like Google's Gemini Pro and setting new standards in video QA and video conversational frameworks. These quantitative measures, alongside the qualitative improvements in world knowledge and detail recognition, underscore the transformative potential of Pegasus-1.
While we recognize the limitations of Pegasus-1, including safety concerns, video length constraints, and occasional hallucinations, these are areas of active research and development. Our commitment to improving Pegasus-1 remains steadfast as we aim to push the boundaries of video understanding technology.
This is a joint team effort across multiple functional groups including model and data (”core” indicates Core Contributor), engineering, product and business development. (First-name alphabetical order)
Model: Aiden Lee, Cooper Han, Flynn Jang (core), Jae Lee, Jay Yi (core), Jeff Kim, Jeremy Kim, Kyle Park, Lucas Lee, Mars Ha, Minjoon Seo, Ray Jung (core), William Go (core)
Data: Daniel Kim (core), Jay Suh (core)
Deployment: Abraham Jo, Ed Park, Hassan Kianinejad, SJ Kim, Tony Moon, Wade Jeong
Product: Andrei Popescu, Esther Kim, EK Yoon, Genie Heo, Henry Choi, Jenna Kang, Kevin Han, Noah Seo, Sunny Nguyen, Ryan Won, Yeonhoo Park
Business & Operations: Anthony Giuliani, Dave Chung, Hans Yoon, James Le, Jenny Ahn, June Lee, Maninder Saini, Meredith Sanders Soyoung Lee, Sue Kim, Travis Couture
If you use this model in your work, please use the following BibTeX citation and cite the author as Twelve Labs:
Introducing our new Embed API in Open Beta, enabling customers to generate state-of-the-art multimodal embeddings.
See how video foundation models can radically accelerate your film making timeline.
Leverage Twelve Labs Embed API and LanceDB to create AI applications that can process and analyze video content with unprecedented accuracy and efficiency.
We had fun interacting with the AI community in Denver!