🎉 TwelveLabs models are now in Amazon Bedrock! Leading video understanding meets with the scale of AWS infrastructure. Learn more here!

Products

Enterprise

Research

Developers

Company

Select Language

Playground

Talk to Sales

🎉 TwelveLabs models are now in Amazon Bedrock! Leading video understanding meets with the scale of AWS infrastructure. Learn more here!

Research-first and human-centered.

At TwelveLabs, we’re developing video-native AI systems that can solve problems with human-level reasoning. Helping machines learn about the world — and enabling humans to retrieve, capture, and tell their visual stories better.

Our brains continually process sensory input – helping us understand what has happened and predict what might happen next. This ability, known as perceptual reasoning, forms the basis of human intelligence.

AI, as rolled out so far, has bypassed a crucial learning step: creating a robust world representation through video, which closely resembles the sensory input that gives rise to human perception.

At TwelveLabs, we’re bridging this gap by training cutting-edge foundation models to learn rich, multimodal representations from video data, then using these representations for high-level reasoning tasks involving language.

Through video-native AI, we’re helping machines learn about the world – and enabling humans to retrieve, capture, and tell their visual stories better.

Our research, recognized.

Our team hail from backgrounds in language, video, machine learning and perception. Together they boast 5+ wins in global competitions and 100+ publications in top AI conferences.

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

ECCV 2018

1st place in LSMDC challenge in ICCV 2017

AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation

CVPR 2020

Character Grounding and Re-Identification in Story of Videos and Text Descriptions

ECCV 2020

1st place in LSMDC challenge in ICCV 2019

Psynet: Self-supervised Approach to Object Localization using Point Symmetric Transformation

AAAI 2020

Full list of papers

The Art of Detail

Perception: Capturing the sensory details through a video-native encoder

Our video-native encoder model, Marengo, is the embodiment of perception. The human sensory organs excel at capturing the world's visuals and auditory details. In line with this, Marengo can analyze visual frames and their temporal relationships, along with speech and sound – ensuring a thorough understanding of both visual and auditory elements.

This context-aware, video-native representation encoder serves as the foundation for our perceptual reasoning pipeline.

THE POWER OF ALIGNMENT

Reasoning: Inducing the perceptual reasoning capability through video and language alignment

True video understanding requires the ability to reason about what is perceived. This is where our video-language model, Pegasus, comes into play.

Pegasus merges the reasoning skills learned from large language models (text data) with the perceptual understanding gained from our video encoder model (video data). By aligning these two modalities, Pegasus can perform cross-modal reasoning, inferring meaning and intent from Marengo's rich, multimodal representations.

It’s the synergy between Marengo and Pegasus — the alignment of video and language – that enables perceptual reasoning capabilities in our AI systems. Building on the strengths of both models, we can develop systems that not only perceive and understand the visual world, but also reason about it in a way that resembles human cognition.

Recognition

Our science team has a background in video and language throughout their careers, with 5+ wins in global competitions and 100+ publications in top AI conferences on video and language.

Rethinking how an AI thinks.

We’re not just developing state-of-the-art models — we’re rethinking how AI systems learn and reason. Explore our publications to learn more about our research and discoveries.

Perception & Reasoning

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

ECCV 2018

1st place in LSMDC challenge in ICCV 2017

AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation

CVPR 2020

Character Grounding and Re-Identification in Story of Videos and Text Descriptions

ECCV 2020

1st place in LSMDC challenge in ICCV 2019

Psynet: Self-supervised Approach to Object Localization using Point Symmetric Transformation

AAAI 2020

Cycled Compositional Learning between Images and Text

CVPR 2020 workshop

1st place in Fashion IQ challenge

Dual Compositional Learning in Interactive Image Retrieval

AAAI 2021

Transitional Adaptation of Pretrained Models for Visual Storytelling

CVPR 2021

ViSeRet: A Simple yet Effective Approach to Moment Retrieval via Fine-Grained Video Segmentation

arXiv Technical Report

1st place in video retrieval track in ICCV VALUE Challenge 2021

Rethinking the Truly Unsupervised Image-To-Image Translation

ICCV 2021

Viewpoint-Agnostic Change Captioning With Cycle Consistency

ICCV 2021

Commonality in Natural Images Rescues GANs: Pre-training GANs with Generic and Privacy-Free Synthetic Data

CVPR 2022

A Drone Video Clip Dataset and its Applications in Automated Cinematography

Computer Graphics Forum 2022

BaSSL: Boundary-Aware Self-Supervised Learning for Video Scene Segmentation

ACCV 2022

Self-Distilled Self-Supervised Representation Learning

WACV 23

Unifying Vision-Language Representation Space with Single-tower Transformer

AAAI 23

Towards Practical Plug-and-Play Diffusion Models

CVPR 2023

Exploring Discontinuity for Video Frame Interpolation

CVPR 2023

A Nonlinear, Regularized, and Data-independent Modulation for Continuously Interactive Image Processing Network

IJCV 2023

Towards a Complete Benchmark on Video Moment Localization

AISTATS 2024

Multi-Architecture Multi-Expert Diffusion Models

AAAI 2024

Addressing Negative Transfer in Diffusion Models

NeurIPS 2023

Denoising Task Routing for Diffusion Models

ICLR 2024

Bridging Implicit and Explicit Geometric Transformation for Single-Image View Synthesis

TPAMI 2024

Addressing Negative Transfer in Diffusion Models

(NeurIPS 2023)

Language & Reasoning

Bidirectional Attention Flow for Machine Comprehension

ICLR 2017

Zero-Shot Relation Extraction via Reading Comprehension

ACL 2017

Neural Speed Reading via Skim-RNN

ECCV 2020

1st place in LSMDC challenge in ICCV 2019

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index

ACL 2019

Towards Continual Knowledge Learning of Language Models

ICLR 2022

Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective

ACL 2023

Evaluation of Question Generation Needs More References

ACL 2023

Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy

NeurIPS 2023

NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models

ACL 2023

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

ICLR 2024

Ready to see your video differently?

Try your own video in our Playground to see next-level video intelligence in action.

Try on Playground

Playground

Talk to Sales

Ready to see your video differently?

Try your own video in our Playground to see next-level video intelligence in action.

Playground

Talk to Sales

Product

Social

For Developers

Solutions

Government & Security

Media & Entertainment

Company

2025

Product

Social

Solutions

Government & Security

Media & Entertainment

Company

For Developers

2025

Product

Social

For Developers

Solutions

Government & Security

Media & Entertainment

Company

2025