Introducing Pegasus 1.2: An Industry-Grade Video Language Model for Scalable Applications

🎉 TwelveLabs models are now in Amazon Bedrock! Leading video understanding meets with the scale of AWS infrastructure. Learn more here!

Products

Enterprise

Research

Developers

Company

Playground

Talk to Sales

🎉 TwelveLabs models are now in Amazon Bedrock! Leading video understanding meets with the scale of AWS infrastructure. Learn more here!

AI/ML

Introducing Pegasus 1.2: An Industry-Grade Video Language Model for Scalable Applications

Aiden Lee, Leo Kim, Kian Kim, Dan Kim, Sam Choi, Sunny Nguyen, James Le

We are excited to announce Pegasus 1.2, our most advanced video language model yet, designed to deliver industry-grade video understanding capabilities for real-world applications.

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

Search, analyze, and explore your videos with AI.

Try the Playground

Feb 10, 2025

7 Min

Copy link to article

1 - Introduction

Unlike many academic settings, real-world video applications face two unique challenges:

Real-world videos range from seconds to multi-hours in length
Accurate temporal understanding is required

To address these commercial demands, today we are announcing our latest work, Pegasus 1.2, a significant leap forward in industry-grade video language models. Pegasus 1.2 achieves state-of-the-art performance in long video understanding. Our model can support videos that are up to 1 hour long with best-in-class accuracy while also maintaining low latency and competitive pricing. Furthermore, our embeddings storage intelligently caches videos, allowing for repeated queries to the same video to be even faster and cheaper.

Thus, Pegasus 1.2 serves as a cutting-edge tool that delivers business value through its focused, intelligent system design—excelling exactly where production-grade video processing pipelines need it most.

2 - Best in-class video language model to handle long videos

While handling long videos is crucial for commercial settings, one of the key concerns is processing time and thus elongated time-to-value. As input videos become longer, a standard video processing / inference system quickly gets bottlenecked to handle orders of magnitude more frames, making it nearly impractical for commercial use and wider-adoption. Furthermore, such system with commercial expectations would be tasked with reasoning over much longer temporal horizons all at the same time providing accurate responses to input queries / prompts.

Latency: To evaluate Pegasus 1.2's speed, we measure time-to-first-token (TTFT) against other frontier model APIs—GPT-4o and Gemini 1.5 Pro—for videos ranging from 3 to 60 minutes long. Leveraging our video-centric model architecture and optimized inference system, Pegasus 1.2 demonstrates consistent time-to-first-token latency for videos up to 15 minutes in duration, while exhibiting superior response times for extended content ranging from 30 to 60 minutes in length.

Performance: We evaluate Pegasus 1.2’s performance against the same set of frontier model APIs using a subset of the Video-MME dataset containing videos longer than 30 minutes (VideoMME-Long). Pegasus 1.2 shows the new state-of-the-art performance, outperforming all other flagship APIs.

Pricing Pegasus 1.2 delivers best-in-class performance for commercial video processing without the premium price tag. Rather than trying to do everything, we've focused on excelling at long videos and accurate temporal understanding. This focused approach has allowed us to create a highly optimized system that delivers superior performance at a competitive price point.

Better yet, our system is designed to support repeated video-to-text generations without breaking the bank. Once videos are indexed, Pegasus 1.2 generates rich video embeddings and stores in our database for subsequent API calls, which allows users to continue generating perpetually at a very low cost. For example, Google Gemini 1.5 Pro’s caching price is $4.5 / 1M tokens per hour of storage (which is roughly the number of tokens for 1 hour of video). Whereas, our embedding storage is only $0.09 per video hour per month, a staggering x36,000 reduction in prices. Our design is much more suitable for customers who have large archive of video and need to understand it all economically.

3 - Real-world applications

Many real-world long video applications require exceptional temporal understanding. As video becomes longer, the information gets spread across the content sequentially constituting a context. Pegasus 1.2 shows exceptional temporal understanding, able to localize and ground the answers based on the timestamp.

For example, a typical scenario is where a user would like to find timestamps of specific events or actions from a football match video.

Question 1: At what time in the video, did the scoring first appear?

Pegasus 1.2 Response

The first instance of scoring in the video appears at 00:56 when the San Francisco 49ers successfully make a 33-yard field goal.

GPT-4o Response

The first scoring appeared at 7:44 in the video, when the San Francisco 49ers kicked a field goal, making the score 3-0.

Gemini 1.5 Pro Response

The score first appeared at [00:00:02], and it's a snowy football game between the San Francisco 49ers and the Buffalo Bills.

Question 2: How long does it take for there to be the first touchdown?

Pegasus 1.2 Response

The first touchdown occurs approximately 1 minute and 54 seconds into the video.

GPT-4o Response

The first touchdown occurs at approximately 12:31 in the fourth quarter, which means it takes until that point in the game for the first touchdown to be scored.

Gemini 1.5 Pro Response

The first touchdown occurs at [00:00:59–00:02:06].

For both questions, Pegasus 1.2 finds the correct timestamps for the target event or action, showcasing its exceptional temporal understanding capability.

Below are other industry-use cases that can be supported by Pegasus 1.2:

Use case 1: Marketing / Brand / Product analysis

Understanding granular visual elements (brand/ product name, professional artist shown on screen), perform a detailed sentiment analysis and follow user’s instruction.

Prompt:

You are analyzing a video tutorial created by a social media influencer showcasing beauty products. The video contains multiple segments where the influencer uses and discusses various products. Based on the video, extract the list of all beauty products featured in the video in the order they appear. For each product, specify:

Brand Name: The name of the brand or manufacturer.
Product Name: The specific product name, as mentioned in the video or displayed on the packaging.
Sentiment Analysis: Assess the influencer’s sentiment towards the product (e.g., positive, neutral, or negative).
Provide key phrases or expressions used that indicate the sentiment.
Note visual or tonal cues (e.g., smiling, excitement in voice, frustration).

Overall Sentiment: Summarize the general sentiment of the entire video tutorial, indicating whether the tone was predominantly positive, neutral, or negative, and explain your reasoning based on the influencer’s behavior and dialogue.

Output:

Based on the video tutorial, the beauty product featured is:

Brand Name: Fenty Beauty by Rihanna
Product Name: Pro Filter Amplifying Eyeshadow Primer
Sentiment Analysis: Positive
Key Phrases/Expressions: "What I love about this eye primer is its weightless application," "It's crease-proof, weightless, fade-proof, and lasts us all day," "It really helps to grip any eyeshadow color, allowing it to last longer."
Visual/Tonal Cues: The makeup artist, Hector Espinal, is smiling and appears enthusiastic while applying the primer. He speaks with excitement and satisfaction.

Overall Sentiment: The overall sentiment of the video tutorial is predominantly positive. This is evident from the makeup artist's enthusiastic expressions and the positive phrases used to describe the primer. The artist's excitement and satisfaction with the product are clear indicators of a positive sentiment.

Use case 2: Metadata Generation

Comprehending objects, people, actions, events and their relationships in video, assign appropriate classes and metadata, and understand non-English content, allowing enterprises to effectively manage video content across their archive.

Prompt:

Build an enterprise ontology and knowledge graph based on the content of this video. Output a json file with classes, entities and relationships that can be used in a knowledge graph visualization tool

Output:

{
  "classes": [
    "Show",
    "Host",
    "Judge",
    "Contestant",
    "DanceStyle",
    "Performance",
    "AudienceReaction",
    "Score"
  ],
  "entities": [
    {
      "class": "Show",
      "id": "DanseAvecLesStars",
      "attributes": {
        "name": "Danse avec les Stars",
        "season": "10",
        "stageSetup": ["large staircase", "bar area", "central dance floor"],
        "visualEffects": ["dynamic lighting", "special effects", "smoke machines", "confetti cannons"]
      }
    },
    {
      "class": "Host",
      "id": "CamilleCombal",
      "attributes": {
        "name": "Camille Combal",
        "role": "Main Host"
      }
    },
    {
      "class": "Host",
      "id": "KarineFerri",
      "attributes": {
        "name": "Karine Ferri",
        "role": "Co-host"
      }
    },
    {
      "class": "Judge",
      "id": "PatrickDupond",
      "attributes": {
        "name": "Patrick Dupond",
        "role": "Judge"
      }
    },
    {
      "class": "Judge",
      "id": "ChrisMarques",
      "attributes": {
        "name": "Chris Marques",
        "role": "Judge"
      }
    },
    {
      "class": "Judge",
      "id": "JeanMarcGenereux",
      "attributes": {
        "name": "Jean-Marc Généreux",
        "role": "Judge"
      }
    },
    {
      "class": "Judge",
      "id": "DenitsaIkonomova",
      "attributes": {
        "name": "Denitsa Ikonomova",
        "role": "Judge"
      }
    },
    {
      "class": "Contestant",
      "id": "IrisMittenaere",
      "attributes": {
        "name": "Iris Mittenaere",
        "profession": "Miss Universe",
        "partner": "Anthony Colette"
      }
    },
    {
      "class": "Contestant",
      "id": "AnthonyColette",
      "attributes": {
        "name": "Anthony Colette",
        "profession": "Professional Dancer",
        "partner": "Iris Mittenaere"
      }
    },
    {
      "class": "DanceStyle",
      "id": "Salsa",
      "attributes": {
        "name": "Salsa"
      }
    },
    {
      "class": "DanceStyle",
      "id": "ChaCha",
      "attributes": {
        "name": "Cha-Cha"
      }
    },
    {
      "class": "DanceStyle",
      "id": "Contemporary",
      "attributes": {
        "name": "Contemporary"
      }
    },
    {
      "class": "DanceStyle",
      "id": "Tango",
      "attributes": {
        "name": "Tango"
      }
    },
    {
      "class": "Performance",
      "id": "Performance1",
      "attributes": {
        "danceStyles": ["Salsa", "Cha-Cha", "Contemporary"],
        "performers": ["Iris Mittenaere", "Anthony Colette"],
        "audienceReaction": "enthusiastic applause",
        "judgesFeedback": ["positive", "constructive criticism"]
      }
    },
    {
      "class": "AudienceReaction",
      "id": "Reaction1",
      "attributes": {
        "type": "cheering",
        "intensity": "high"
      }
    },
    {
      "class": "Score",
      "id": "Score1",
      "attributes": {
        "PatrickDupond": 6,
        "ChrisMarques": 6,
        "JeanMarcGenereux": 5,
        "DenitsaIkonomova": 6,
        "total": 25
      }
    }
  ],
  "relationships": [
    {
      "source": "DanseAvecLesStars",
      "target": "CamilleCombal",
      "type": "hasHost"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "KarineFerri",
      "type": "hasHost"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "PatrickDupond",
      "type": "hasJudge"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "ChrisMarques",
      "type": "hasJudge"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "JeanMarcGenereux",
      "type": "hasJudge"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "DenitsaIkonomova",
      "type": "hasJudge"
    },
    {
      "source": "IrisMittenaere",
      "target": "AnthonyColette",
      "type": "hasPartner"
    },
    {
      "source": "Performance1",
      "target": "IrisMittenaere",
      "type": "performedBy"
    },
    {
      "source": "Performance1",
      "target": "AnthonyColette",
      "type": "performedBy"
    },
    {
      "source": "Performance1",
      "target": "Salsa",
      "type": "includesDanceStyle"
    },
    {
      "source": "Performance1",
      "target": "ChaCha",
      "type": "includesDanceStyle"
    },
    {
      "source": "Performance1",
      "target": "Contemporary",
      "type": "includesDanceStyle"
    },
    {
      "source": "Performance1",
      "target": "AudienceReaction1",
      "type": "receivedReaction"
    },
    {
      "source": "Performance1",
      "target": "Score1",
      "type": "receivedScore"
    }
  ]
}

Use case 3: Creative optimization and content localization

The reasoning capability with timestamps allows users to adapt their video to certain criteria, markets, or demographics.

Prompt:

Suggest edits for the ads to make it appeal to kids. Break down by timestamp and include voice over as appropriate.

Output:

To make the Amazon Go ads more appealing to kids, here are some suggested edits, broken down by timestamp and including voice-over suggestions where appropriate:

[0s (00:00) - 11s (00:11)]

Visual Edit: Add playful animations of food items (e.g., a dancing jar of peanut butter) to capture kids' attention.
Voice Over: "Welcome to the magical world of Amazon Go! Where shopping is as fun as a playground!"

[12s (00:12) - 21s (00:21)]

Visual Edit: Include colorful graphics and fun facts about the food items, such as "Did you know? This cheese is perfect for making yummy grilled cheese sandwiches!"
Voice Over: "Explore the aisles with your favorite characters! Discover new snacks and treats!"

[22s (00:22) - 29s (00:29)]

Visual Edit: Add a cartoon character (e.g., a friendly robot) guiding the customers through the store.
Voice Over: "Follow our robot friend as he helps you find the best snacks!"

[30s (00:30) - 36s (00:36)]

Visual Edit: Show a child using the app with a fun, animated interface.
Voice Over: "Kids, you can help mom and dad with the shopping! Just tap and go!"

[37s (00:37) - 45s (00:45)]

Visual Edit: Include a montage of kids enjoying the food items, with playful sound effects.
Voice Over: "Taste the joy of Amazon Go! Every bite is a happy moment!"

[46s (00:46) - 57s (00:57)]

Visual Edit: Add a fun, animated list of items with playful fonts and colors.
Voice Over: "Check off your favorites and make a shopping list that's as fun as it is useful!"

[58s (00:58) - 71s (01:11)]

Visual Edit: Show a child-friendly version of the store with bright colors and interactive elements.
Voice Over: "Welcome to the kid's corner of Amazon Go! Where every visit is an adventure!"

[72s (01:12) - 80s (01:20)]

Visual Edit: Include a montage of kids and families enjoying their shopping experience.
Voice Over: "Family fun at Amazon Go! Where every shopping trip is a family adventure!"

[81s (01:21) - 89s (01:29)]

Visual Edit: Show a child-friendly version of the app interface with fun animations.
Voice Over: "Kids, you can help with the shopping! Just tap and go!"

[90s (01:30) - 96s (01:36)]

Visual Edit: Add playful animations and graphics around the store's exterior.
Voice Over: "Welcome to the magical world of Amazon Go! Where shopping is as fun as a playground!"

These edits incorporate playful elements, colorful graphics, and fun voice-overs to make the ads more engaging and appealing to kids.

Use case 4: Sports analysis

Timestamp understanding in sports allows fans, coaches, players of sport teams to analyze data, extract insights from sport events and power downstream workflows such as fan engagement, player coaching and performance report.

Prompt:

Following the video content, do the following tasks:

Create the summary as a Golden State Warriors fan, highlighting what the team did well.
At what time in the video, the game ended?
What is the final score?

Output:

Summary as a Golden State Warriors Fan:

The Golden State Warriors demonstrated exceptional performance throughout the game, particularly in their offensive plays and team coordination. Stephen Curry, wearing jersey number 30, was a standout player with his consistent three-point shooting, making crucial shots that extended the team's lead. Jordan Poole, wearing jersey number 3, also contributed significantly with his dribbling and scoring abilities, adding to the team's offensive firepower. The Warriors' effective passing and teamwork were evident in several plays, where they created scoring opportunities through precise ball movement. Additionally, their defensive efforts were noteworthy, with key blocks and steals that prevented the Boston Celtics from narrowing the gap. Overall, the Warriors' ability to execute plays effectively, maintain a lead, and respond to the Celtics' scoring attempts made for a strong and cohesive team performance.

Game End Time:

The game ended at 578 seconds (09:38) in the video.

Final Score:

The final score displayed in the video was 103-90 in favor of the Golden State Warriors.

4. Model Overview & Limitation

Architecture:

Pegasus 1.2 employs an encoder-decoder architecture optimized for comprehensive video understanding, featuring three primary components: a Video Encoder, a Video Tokenizer, and a Large Language Model. This architecture enables sophisticated processing of both visual and textual information while maintaining computational efficiency.

The integration of these components creates a cohesive system capable of processing both fine-grained details and long-term contextual information. Our architecture demonstrates that compact models can achieve sophisticated video understanding through thoughtful design choices and innovative solutions to fundamental challenges in multimodal processing.

Limitation:

Safety & Biases: Pegasus 1.2 is designed with safety mechanisms; however, as with any AI model, there is a risk of generating content that could be considered harmful or inappropriate without proper oversight and regulation. Our understanding of ethical and safety measures for video foundation models is ongoing. As we continue testing and gather feedback, a detailed evaluation and ethics report will be made available.

Hallucinations: Pegasus 1.2 can occasionally produce inaccurate outputs. While we have improved it from our Pegasus 1.1 to reduce hallucination, users should be mindful of this limitation, especially for tasks where high precision is required and factual correctness is critical.

Acknowledgements

This release is a joint team effort across multiple functional groups including science, engineering, product, business development, and operation. It is co-authored by the Pegasus team under Twelve Labs Research Science division.

Resources

Link to sign up and play with our API and Playground
Link to the API documentation
Link to our Discord community to connect with fellow users and developers

1 - Introduction

Unlike many academic settings, real-world video applications face two unique challenges:

Real-world videos range from seconds to multi-hours in length
Accurate temporal understanding is required

2 - Best in-class video language model to handle long videos

3 - Real-world applications

For example, a typical scenario is where a user would like to find timestamps of specific events or actions from a football match video.

Question 1: At what time in the video, did the scoring first appear?

Pegasus 1.2 Response

The first instance of scoring in the video appears at 00:56 when the San Francisco 49ers successfully make a 33-yard field goal.

GPT-4o Response

The first scoring appeared at 7:44 in the video, when the San Francisco 49ers kicked a field goal, making the score 3-0.

Gemini 1.5 Pro Response

The score first appeared at [00:00:02], and it's a snowy football game between the San Francisco 49ers and the Buffalo Bills.

Question 2: How long does it take for there to be the first touchdown?

Pegasus 1.2 Response

The first touchdown occurs approximately 1 minute and 54 seconds into the video.

GPT-4o Response

The first touchdown occurs at approximately 12:31 in the fourth quarter, which means it takes until that point in the game for the first touchdown to be scored.

Gemini 1.5 Pro Response

The first touchdown occurs at [00:00:59–00:02:06].

For both questions, Pegasus 1.2 finds the correct timestamps for the target event or action, showcasing its exceptional temporal understanding capability.

Below are other industry-use cases that can be supported by Pegasus 1.2:

Use case 1: Marketing / Brand / Product analysis

Understanding granular visual elements (brand/ product name, professional artist shown on screen), perform a detailed sentiment analysis and follow user’s instruction.

Prompt:

Brand Name: The name of the brand or manufacturer.
Product Name: The specific product name, as mentioned in the video or displayed on the packaging.
Sentiment Analysis: Assess the influencer’s sentiment towards the product (e.g., positive, neutral, or negative).
Provide key phrases or expressions used that indicate the sentiment.
Note visual or tonal cues (e.g., smiling, excitement in voice, frustration).

Output:

Based on the video tutorial, the beauty product featured is:

Brand Name: Fenty Beauty by Rihanna
Product Name: Pro Filter Amplifying Eyeshadow Primer
Sentiment Analysis: Positive
Key Phrases/Expressions: "What I love about this eye primer is its weightless application," "It's crease-proof, weightless, fade-proof, and lasts us all day," "It really helps to grip any eyeshadow color, allowing it to last longer."
Visual/Tonal Cues: The makeup artist, Hector Espinal, is smiling and appears enthusiastic while applying the primer. He speaks with excitement and satisfaction.

Use case 2: Metadata Generation

Prompt:

Output:

{
  "classes": [
    "Show",
    "Host",
    "Judge",
    "Contestant",
    "DanceStyle",
    "Performance",
    "AudienceReaction",
    "Score"
  ],
  "entities": [
    {
      "class": "Show",
      "id": "DanseAvecLesStars",
      "attributes": {
        "name": "Danse avec les Stars",
        "season": "10",
        "stageSetup": ["large staircase", "bar area", "central dance floor"],
        "visualEffects": ["dynamic lighting", "special effects", "smoke machines", "confetti cannons"]
      }
    },
    {
      "class": "Host",
      "id": "CamilleCombal",
      "attributes": {
        "name": "Camille Combal",
        "role": "Main Host"
      }
    },
    {
      "class": "Host",
      "id": "KarineFerri",
      "attributes": {
        "name": "Karine Ferri",
        "role": "Co-host"
      }
    },
    {
      "class": "Judge",
      "id": "PatrickDupond",
      "attributes": {
        "name": "Patrick Dupond",
        "role": "Judge"
      }
    },
    {
      "class": "Judge",
      "id": "ChrisMarques",
      "attributes": {
        "name": "Chris Marques",
        "role": "Judge"
      }
    },
    {
      "class": "Judge",
      "id": "JeanMarcGenereux",
      "attributes": {
        "name": "Jean-Marc Généreux",
        "role": "Judge"
      }
    },
    {
      "class": "Judge",
      "id": "DenitsaIkonomova",
      "attributes": {
        "name": "Denitsa Ikonomova",
        "role": "Judge"
      }
    },
    {
      "class": "Contestant",
      "id": "IrisMittenaere",
      "attributes": {
        "name": "Iris Mittenaere",
        "profession": "Miss Universe",
        "partner": "Anthony Colette"
      }
    },
    {
      "class": "Contestant",
      "id": "AnthonyColette",
      "attributes": {
        "name": "Anthony Colette",
        "profession": "Professional Dancer",
        "partner": "Iris Mittenaere"
      }
    },
    {
      "class": "DanceStyle",
      "id": "Salsa",
      "attributes": {
        "name": "Salsa"
      }
    },
    {
      "class": "DanceStyle",
      "id": "ChaCha",
      "attributes": {
        "name": "Cha-Cha"
      }
    },
    {
      "class": "DanceStyle",
      "id": "Contemporary",
      "attributes": {
        "name": "Contemporary"
      }
    },
    {
      "class": "DanceStyle",
      "id": "Tango",
      "attributes": {
        "name": "Tango"
      }
    },
    {
      "class": "Performance",
      "id": "Performance1",
      "attributes": {
        "danceStyles": ["Salsa", "Cha-Cha", "Contemporary"],
        "performers": ["Iris Mittenaere", "Anthony Colette"],
        "audienceReaction": "enthusiastic applause",
        "judgesFeedback": ["positive", "constructive criticism"]
      }
    },
    {
      "class": "AudienceReaction",
      "id": "Reaction1",
      "attributes": {
        "type": "cheering",
        "intensity": "high"
      }
    },
    {
      "class": "Score",
      "id": "Score1",
      "attributes": {
        "PatrickDupond": 6,
        "ChrisMarques": 6,
        "JeanMarcGenereux": 5,
        "DenitsaIkonomova": 6,
        "total": 25
      }
    }
  ],
  "relationships": [
    {
      "source": "DanseAvecLesStars",
      "target": "CamilleCombal",
      "type": "hasHost"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "KarineFerri",
      "type": "hasHost"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "PatrickDupond",
      "type": "hasJudge"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "ChrisMarques",
      "type": "hasJudge"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "JeanMarcGenereux",
      "type": "hasJudge"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "DenitsaIkonomova",
      "type": "hasJudge"
    },
    {
      "source": "IrisMittenaere",
      "target": "AnthonyColette",
      "type": "hasPartner"
    },
    {
      "source": "Performance1",
      "target": "IrisMittenaere",
      "type": "performedBy"
    },
    {
      "source": "Performance1",
      "target": "AnthonyColette",
      "type": "performedBy"
    },
    {
      "source": "Performance1",
      "target": "Salsa",
      "type": "includesDanceStyle"
    },
    {
      "source": "Performance1",
      "target": "ChaCha",
      "type": "includesDanceStyle"
    },
    {
      "source": "Performance1",
      "target": "Contemporary",
      "type": "includesDanceStyle"
    },
    {
      "source": "Performance1",
      "target": "AudienceReaction1",
      "type": "receivedReaction"
    },
    {
      "source": "Performance1",
      "target": "Score1",
      "type": "receivedScore"
    }
  ]
}

Use case 3: Creative optimization and content localization

The reasoning capability with timestamps allows users to adapt their video to certain criteria, markets, or demographics.

Prompt:

Suggest edits for the ads to make it appeal to kids. Break down by timestamp and include voice over as appropriate.

Output:

To make the Amazon Go ads more appealing to kids, here are some suggested edits, broken down by timestamp and including voice-over suggestions where appropriate:

[0s (00:00) - 11s (00:11)]

Visual Edit: Add playful animations of food items (e.g., a dancing jar of peanut butter) to capture kids' attention.
Voice Over: "Welcome to the magical world of Amazon Go! Where shopping is as fun as a playground!"

[12s (00:12) - 21s (00:21)]

Visual Edit: Include colorful graphics and fun facts about the food items, such as "Did you know? This cheese is perfect for making yummy grilled cheese sandwiches!"
Voice Over: "Explore the aisles with your favorite characters! Discover new snacks and treats!"

[22s (00:22) - 29s (00:29)]

Visual Edit: Add a cartoon character (e.g., a friendly robot) guiding the customers through the store.
Voice Over: "Follow our robot friend as he helps you find the best snacks!"

[30s (00:30) - 36s (00:36)]

Visual Edit: Show a child using the app with a fun, animated interface.
Voice Over: "Kids, you can help mom and dad with the shopping! Just tap and go!"

[37s (00:37) - 45s (00:45)]

Visual Edit: Include a montage of kids enjoying the food items, with playful sound effects.
Voice Over: "Taste the joy of Amazon Go! Every bite is a happy moment!"

[46s (00:46) - 57s (00:57)]

Visual Edit: Add a fun, animated list of items with playful fonts and colors.
Voice Over: "Check off your favorites and make a shopping list that's as fun as it is useful!"

[58s (00:58) - 71s (01:11)]

Visual Edit: Show a child-friendly version of the store with bright colors and interactive elements.
Voice Over: "Welcome to the kid's corner of Amazon Go! Where every visit is an adventure!"

[72s (01:12) - 80s (01:20)]

Visual Edit: Include a montage of kids and families enjoying their shopping experience.
Voice Over: "Family fun at Amazon Go! Where every shopping trip is a family adventure!"

[81s (01:21) - 89s (01:29)]

Visual Edit: Show a child-friendly version of the app interface with fun animations.
Voice Over: "Kids, you can help with the shopping! Just tap and go!"

[90s (01:30) - 96s (01:36)]

Visual Edit: Add playful animations and graphics around the store's exterior.
Voice Over: "Welcome to the magical world of Amazon Go! Where shopping is as fun as a playground!"

These edits incorporate playful elements, colorful graphics, and fun voice-overs to make the ads more engaging and appealing to kids.

Use case 4: Sports analysis

Prompt:

Following the video content, do the following tasks:

Create the summary as a Golden State Warriors fan, highlighting what the team did well.
At what time in the video, the game ended?
What is the final score?

Output:

Summary as a Golden State Warriors Fan:

Game End Time:

The game ended at 578 seconds (09:38) in the video.

Final Score:

The final score displayed in the video was 103-90 in favor of the Golden State Warriors.

4. Model Overview & Limitation

Architecture:

Limitation:

Acknowledgements

Resources

Link to sign up and play with our API and Playground
Link to the API documentation
Link to our Discord community to connect with fellow users and developers

1 - Introduction

Unlike many academic settings, real-world video applications face two unique challenges:

Real-world videos range from seconds to multi-hours in length
Accurate temporal understanding is required

2 - Best in-class video language model to handle long videos

3 - Real-world applications

For example, a typical scenario is where a user would like to find timestamps of specific events or actions from a football match video.

Question 1: At what time in the video, did the scoring first appear?

Pegasus 1.2 Response

The first instance of scoring in the video appears at 00:56 when the San Francisco 49ers successfully make a 33-yard field goal.

GPT-4o Response

The first scoring appeared at 7:44 in the video, when the San Francisco 49ers kicked a field goal, making the score 3-0.

Gemini 1.5 Pro Response

The score first appeared at [00:00:02], and it's a snowy football game between the San Francisco 49ers and the Buffalo Bills.

Question 2: How long does it take for there to be the first touchdown?

Pegasus 1.2 Response

The first touchdown occurs approximately 1 minute and 54 seconds into the video.

GPT-4o Response

The first touchdown occurs at approximately 12:31 in the fourth quarter, which means it takes until that point in the game for the first touchdown to be scored.

Gemini 1.5 Pro Response

The first touchdown occurs at [00:00:59–00:02:06].

For both questions, Pegasus 1.2 finds the correct timestamps for the target event or action, showcasing its exceptional temporal understanding capability.

Below are other industry-use cases that can be supported by Pegasus 1.2:

Use case 1: Marketing / Brand / Product analysis

Understanding granular visual elements (brand/ product name, professional artist shown on screen), perform a detailed sentiment analysis and follow user’s instruction.

Prompt:

Brand Name: The name of the brand or manufacturer.
Product Name: The specific product name, as mentioned in the video or displayed on the packaging.
Sentiment Analysis: Assess the influencer’s sentiment towards the product (e.g., positive, neutral, or negative).
Provide key phrases or expressions used that indicate the sentiment.
Note visual or tonal cues (e.g., smiling, excitement in voice, frustration).

Output:

Based on the video tutorial, the beauty product featured is:

Brand Name: Fenty Beauty by Rihanna
Product Name: Pro Filter Amplifying Eyeshadow Primer
Sentiment Analysis: Positive
Key Phrases/Expressions: "What I love about this eye primer is its weightless application," "It's crease-proof, weightless, fade-proof, and lasts us all day," "It really helps to grip any eyeshadow color, allowing it to last longer."
Visual/Tonal Cues: The makeup artist, Hector Espinal, is smiling and appears enthusiastic while applying the primer. He speaks with excitement and satisfaction.

Use case 2: Metadata Generation

Prompt:

Output:

{
  "classes": [
    "Show",
    "Host",
    "Judge",
    "Contestant",
    "DanceStyle",
    "Performance",
    "AudienceReaction",
    "Score"
  ],
  "entities": [
    {
      "class": "Show",
      "id": "DanseAvecLesStars",
      "attributes": {
        "name": "Danse avec les Stars",
        "season": "10",
        "stageSetup": ["large staircase", "bar area", "central dance floor"],
        "visualEffects": ["dynamic lighting", "special effects", "smoke machines", "confetti cannons"]
      }
    },
    {
      "class": "Host",
      "id": "CamilleCombal",
      "attributes": {
        "name": "Camille Combal",
        "role": "Main Host"
      }
    },
    {
      "class": "Host",
      "id": "KarineFerri",
      "attributes": {
        "name": "Karine Ferri",
        "role": "Co-host"
      }
    },
    {
      "class": "Judge",
      "id": "PatrickDupond",
      "attributes": {
        "name": "Patrick Dupond",
        "role": "Judge"
      }
    },
    {
      "class": "Judge",
      "id": "ChrisMarques",
      "attributes": {
        "name": "Chris Marques",
        "role": "Judge"
      }
    },
    {
      "class": "Judge",
      "id": "JeanMarcGenereux",
      "attributes": {
        "name": "Jean-Marc Généreux",
        "role": "Judge"
      }
    },
    {
      "class": "Judge",
      "id": "DenitsaIkonomova",
      "attributes": {
        "name": "Denitsa Ikonomova",
        "role": "Judge"
      }
    },
    {
      "class": "Contestant",
      "id": "IrisMittenaere",
      "attributes": {
        "name": "Iris Mittenaere",
        "profession": "Miss Universe",
        "partner": "Anthony Colette"
      }
    },
    {
      "class": "Contestant",
      "id": "AnthonyColette",
      "attributes": {
        "name": "Anthony Colette",
        "profession": "Professional Dancer",
        "partner": "Iris Mittenaere"
      }
    },
    {
      "class": "DanceStyle",
      "id": "Salsa",
      "attributes": {
        "name": "Salsa"
      }
    },
    {
      "class": "DanceStyle",
      "id": "ChaCha",
      "attributes": {
        "name": "Cha-Cha"
      }
    },
    {
      "class": "DanceStyle",
      "id": "Contemporary",
      "attributes": {
        "name": "Contemporary"
      }
    },
    {
      "class": "DanceStyle",
      "id": "Tango",
      "attributes": {
        "name": "Tango"
      }
    },
    {
      "class": "Performance",
      "id": "Performance1",
      "attributes": {
        "danceStyles": ["Salsa", "Cha-Cha", "Contemporary"],
        "performers": ["Iris Mittenaere", "Anthony Colette"],
        "audienceReaction": "enthusiastic applause",
        "judgesFeedback": ["positive", "constructive criticism"]
      }
    },
    {
      "class": "AudienceReaction",
      "id": "Reaction1",
      "attributes": {
        "type": "cheering",
        "intensity": "high"
      }
    },
    {
      "class": "Score",
      "id": "Score1",
      "attributes": {
        "PatrickDupond": 6,
        "ChrisMarques": 6,
        "JeanMarcGenereux": 5,
        "DenitsaIkonomova": 6,
        "total": 25
      }
    }
  ],
  "relationships": [
    {
      "source": "DanseAvecLesStars",
      "target": "CamilleCombal",
      "type": "hasHost"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "KarineFerri",
      "type": "hasHost"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "PatrickDupond",
      "type": "hasJudge"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "ChrisMarques",
      "type": "hasJudge"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "JeanMarcGenereux",
      "type": "hasJudge"
    },
    {
      "source": "DanseAvecLesStars",
      "target": "DenitsaIkonomova",
      "type": "hasJudge"
    },
    {
      "source": "IrisMittenaere",
      "target": "AnthonyColette",
      "type": "hasPartner"
    },
    {
      "source": "Performance1",
      "target": "IrisMittenaere",
      "type": "performedBy"
    },
    {
      "source": "Performance1",
      "target": "AnthonyColette",
      "type": "performedBy"
    },
    {
      "source": "Performance1",
      "target": "Salsa",
      "type": "includesDanceStyle"
    },
    {
      "source": "Performance1",
      "target": "ChaCha",
      "type": "includesDanceStyle"
    },
    {
      "source": "Performance1",
      "target": "Contemporary",
      "type": "includesDanceStyle"
    },
    {
      "source": "Performance1",
      "target": "AudienceReaction1",
      "type": "receivedReaction"
    },
    {
      "source": "Performance1",
      "target": "Score1",
      "type": "receivedScore"
    }
  ]
}

Use case 3: Creative optimization and content localization

The reasoning capability with timestamps allows users to adapt their video to certain criteria, markets, or demographics.

Prompt:

Suggest edits for the ads to make it appeal to kids. Break down by timestamp and include voice over as appropriate.

Output:

To make the Amazon Go ads more appealing to kids, here are some suggested edits, broken down by timestamp and including voice-over suggestions where appropriate:

[0s (00:00) - 11s (00:11)]

Visual Edit: Add playful animations of food items (e.g., a dancing jar of peanut butter) to capture kids' attention.
Voice Over: "Welcome to the magical world of Amazon Go! Where shopping is as fun as a playground!"

[12s (00:12) - 21s (00:21)]

Visual Edit: Include colorful graphics and fun facts about the food items, such as "Did you know? This cheese is perfect for making yummy grilled cheese sandwiches!"
Voice Over: "Explore the aisles with your favorite characters! Discover new snacks and treats!"

[22s (00:22) - 29s (00:29)]

Visual Edit: Add a cartoon character (e.g., a friendly robot) guiding the customers through the store.
Voice Over: "Follow our robot friend as he helps you find the best snacks!"

[30s (00:30) - 36s (00:36)]

Visual Edit: Show a child using the app with a fun, animated interface.
Voice Over: "Kids, you can help mom and dad with the shopping! Just tap and go!"

[37s (00:37) - 45s (00:45)]

Visual Edit: Include a montage of kids enjoying the food items, with playful sound effects.
Voice Over: "Taste the joy of Amazon Go! Every bite is a happy moment!"

[46s (00:46) - 57s (00:57)]

Visual Edit: Add a fun, animated list of items with playful fonts and colors.
Voice Over: "Check off your favorites and make a shopping list that's as fun as it is useful!"

[58s (00:58) - 71s (01:11)]

Visual Edit: Show a child-friendly version of the store with bright colors and interactive elements.
Voice Over: "Welcome to the kid's corner of Amazon Go! Where every visit is an adventure!"

[72s (01:12) - 80s (01:20)]

Visual Edit: Include a montage of kids and families enjoying their shopping experience.
Voice Over: "Family fun at Amazon Go! Where every shopping trip is a family adventure!"

[81s (01:21) - 89s (01:29)]

Visual Edit: Show a child-friendly version of the app interface with fun animations.
Voice Over: "Kids, you can help with the shopping! Just tap and go!"

[90s (01:30) - 96s (01:36)]

Visual Edit: Add playful animations and graphics around the store's exterior.
Voice Over: "Welcome to the magical world of Amazon Go! Where shopping is as fun as a playground!"

These edits incorporate playful elements, colorful graphics, and fun voice-overs to make the ads more engaging and appealing to kids.

Use case 4: Sports analysis

Prompt:

Following the video content, do the following tasks:

Create the summary as a Golden State Warriors fan, highlighting what the team did well.
At what time in the video, the game ended?
What is the final score?

Output:

Summary as a Golden State Warriors Fan:

Game End Time:

The game ended at 578 seconds (09:38) in the video.

Final Score:

The final score displayed in the video was 103-90 in favor of the Golden State Warriors.