Pioneering perceptual reasoning through video understanding

Imagine time freezing at a specific moment, like a snapshot. Up to this point, our brains have been processing sensory input, helping us understand what has happened, predict what might happen next, and plan accordingly. This ability, known as perceptual reasoning, forms the basis of our intelligence.
Many development studies show that humans develop perception (the ability to gain knowledge from their primary surroundings) in the first months of life before developing any language ability (Object Perception, Object-directed Action, and Physical Knowledge in Infancy, 1995. However, modern AI has bypassed a crucial learning step: creating a robust world representation through video, which closely resembles sensory data.
At Twelve Labs, we are at the forefront of this research, focusing on the critical role of perceptual reasoning. We aim to unlock the full potential of video understanding and redefine how machines perceive and interact with the world. Our research aims to bridge this gap by training cutting-edge foundation models to learn rich, multimodal representations from video data and leverage these representations for complex high-level reasoning tasks using language.
The Art of Detail

Perception: Capturing the sensory details through a video-native encoder

Our video-native encoder model, Marengo, is the embodiment of Perception. The human sensory organs excel at capturing the world's visuals and auditory details. In line with this, we strive to train a video-native representation model (Marengo) that can analyze visual frames and their temporal relationships, along with speech and sound. It ensures a thorough understanding of both visual and auditory elements. We aim to train a context-aware video-native representation encoder that will serve as the foundation for our perceptual reasoning pipeline.

Relevant Achievements

  1. ViSeRet: 2021.11 - A simple yet effective video retrieval approach that won 1st place in the 2021 ICCV VALUE Challenge (beating Microsoft, Kakao Brain, ByteDance)

  2. Marengo-2.5 Release: 2023.04

  3. Marengo-2.6 Technical Blog: 2024.04 - A video foundation model for any-to-any search with SOTA performance

THE POWER OF ALIGNMENT

Reasoning: Inducing the perceptual reasoning capability through video and language alignment

True video understanding requires the ability to reason about the perceived. This is where our video-language model, Pegasus, comes into play. Pegasus merges the reasoning skills learned from large language models (text data) with the perceptual understanding gained from our video encoder model (video data). By aligning these two modalities, Pegasus can perform cross-modal reasoning, inferring meaning and intent from Marengo's rich, multimodal representations.

Relevant Achievements

  1. Pegasus-1-Alpha (80B): 2023.08 - The world’s first video language model deployment (before the release of OpenAI’s GPT4v)

  2. Pegasus-1-Beta (17B): 2024.03 - A state-of-the-art video language model that beats Gemini 1.5

The synergy between Marengo and Pegasus is the key to inducing perceptual reasoning capabilities in our AI systems. By leveraging the strengths of both models, we can develop systems that not only perceive and understand the visual world but also reason about it in a way that resembles human cognition.

At Twelve Labs, we are committed to pushing the boundaries of intelligence with our focus on perceptual reasoning. Our research is not just about developing state-of-the-art models, but about fundamentally rethinking how AI systems can learn and reason about the world. Join us on this exciting journey as we pioneer the future of video understanding and unlock the full potential of artificial intelligence.
recognition

Below are the team's past experiences

Our science team has a background in video and language throughout their careers, with 5+ wins in global competitions and 100+ publications in top AI conferences on video and language.

Perception & Reasoning
(ECCV 2018) (1st place in LSMDC challenge in ICCV 2017)
(ECCV 2020) (1st place in LSMDC challenge in ICCV 2019)
(CVPR 2020 workshop) (1st place in Fashion IQ challenge)
(arXiv Technical Report) (1st place in video retrieval track in ICCV VALUE Challenge 2021)