Products

For Developers

Company

Research-first and human-centered.

At TwelveLabs, we’re developing video-native AI systems that can solve problems with human-level reasoning. Helping machines learn about the world — and enabling humans to retrieve, capture, and tell their visual stories better.

At TwelveLabs, we’re developing video-native AI systems that can solve problems with human-level reasoning. Helping machines learn about the world — and enabling humans to retrieve, capture, and tell their visual stories better.

CTA Cover

Our brains continually process sensory input – helping us understand what has happened and predict what might happen next. This ability, known as perceptual reasoning, forms the basis of human intelligence.

AI, as rolled out so far, has bypassed a crucial learning step: creating a robust world representation through video, which closely resembles the sensory input that gives rise to human perception.

At TwelveLabs, we’re bridging this gap by training cutting-edge foundation models to learn rich, multimodal representations from video data, then using these representations for high-level reasoning tasks involving language.

Through video-native AI, we’re helping machines learn about the world – and enabling humans to retrieve, capture, and tell their visual stories better. 

The Art of Detail

Perception: Capturing the sensory details through a video-native encoder

Our video-native encoder model, Marengo, is the embodiment of perception. The human sensory organs excel at capturing the world's visuals and auditory details. In line with this, Marengo can analyze visual frames and their temporal relationships, along with speech and sound – ensuring a thorough understanding of both visual and auditory elements.

This context-aware, video-native representation encoder serves as the foundation for our perceptual reasoning pipeline.

Research illustration
Research illustration
Research illustration
Research illustration
Research illustration
Research illustration

THE POWER OF ALIGNMENT

Reasoning: Inducing the perceptual reasoning capability through video and language alignment

True video understanding requires the ability to reason about what is perceived. This is where our video-language model, Pegasus, comes into play. 

Pegasus merges the reasoning skills learned from large language models (text data) with the perceptual understanding gained from our video encoder model (video data). By aligning these two modalities, Pegasus can perform cross-modal reasoning, inferring meaning and intent from Marengo's rich, multimodal representations.

It’s the synergy between Marengo and Pegasus — the alignment of video and language – that enables perceptual reasoning capabilities in our AI systems. Building on the strengths of both models, we can develop systems that not only perceive and understand the visual world, but also reason about it in a way that resembles human cognition.

Recognition

Our science team has a background in video and language throughout their careers, with 5+ wins in global competitions and 100+ publications in top AI conferences on video and language.

Logo
ECCV
ICLR
Logo

Rethinking how an AI thinks.

We’re not just developing state-of-the-art models — we’re rethinking how AI systems learn and reason. Explore our publications to learn more about our research and discoveries.

We’re not just developing state-of-the-art models — we’re rethinking how AI systems learn and reason. Explore our publications to learn more about our research and discoveries.

Perception & Reasoning

Perception & Reasoning

Cover image

Ready to see your video differently?

Try your own video in our Playground to see next-level video intelligence in action.