Research-first and human-centered.

Our brains continually process sensory input – helping us understand what has happened and predict what might happen next. This ability, known as perceptual reasoning, forms the basis of human intelligence.
AI, as rolled out so far, has bypassed a crucial learning step: creating a robust world representation through video, which closely resembles the sensory input that gives rise to human perception.
At TwelveLabs, we’re bridging this gap by training cutting-edge foundation models to learn rich, multimodal representations from video data, then using these representations for high-level reasoning tasks involving language.
Through video-native AI, we’re helping machines learn about the world – and enabling humans to retrieve, capture, and tell their visual stories better.
Our research, recognized.
Our team hail from backgrounds in language, video, machine learning and perception. Together they boast 5+ wins in global competitions and 100+ publications in top AI conferences.
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
ECCV 2018
1st place in LSMDC challenge in ICCV 2017
AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation
CVPR 2020
Character Grounding and Re-Identification in Story of Videos and Text Descriptions
ECCV 2020
1st place in LSMDC challenge in ICCV 2019
Psynet: Self-supervised Approach to Object Localization using Point Symmetric Transformation
AAAI 2020
The Art of Detail
Perception: Capturing the sensory details through a video-native encoder
Our video-native encoder model, Marengo, is the embodiment of perception. The human sensory organs excel at capturing the world's visuals and auditory details. In line with this, Marengo can analyze visual frames and their temporal relationships, along with speech and sound – ensuring a thorough understanding of both visual and auditory elements.
This context-aware, video-native representation encoder serves as the foundation for our perceptual reasoning pipeline.
THE POWER OF ALIGNMENT
Reasoning: Inducing the perceptual reasoning capability through video and language alignment
True video understanding requires the ability to reason about what is perceived. This is where our video-language model, Pegasus, comes into play.
Pegasus merges the reasoning skills learned from large language models (text data) with the perceptual understanding gained from our video encoder model (video data). By aligning these two modalities, Pegasus can perform cross-modal reasoning, inferring meaning and intent from Marengo's rich, multimodal representations.
It’s the synergy between Marengo and Pegasus — the alignment of video and language – that enables perceptual reasoning capabilities in our AI systems. Building on the strengths of both models, we can develop systems that not only perceive and understand the visual world, but also reason about it in a way that resembles human cognition.
Recognition
Our science team has a background in video and language throughout their careers, with 5+ wins in global competitions and 100+ publications in top AI conferences on video and language.




Rethinking how an AI thinks.
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
ECCV 2018
1st place in LSMDC challenge in ICCV 2017
AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation
CVPR 2020
Character Grounding and Re-Identification in Story of Videos and Text Descriptions
ECCV 2020
1st place in LSMDC challenge in ICCV 2019
Psynet: Self-supervised Approach to Object Localization using Point Symmetric Transformation
AAAI 2020
Cycled Compositional Learning between Images and Text
CVPR 2020 workshop
1st place in Fashion IQ challenge
Dual Compositional Learning in Interactive Image Retrieval
AAAI 2021
Transitional Adaptation of Pretrained Models for Visual Storytelling
CVPR 2021
ViSeRet: A Simple yet Effective Approach to Moment Retrieval via Fine-Grained Video Segmentation
arXiv Technical Report
1st place in video retrieval track in ICCV VALUE Challenge 2021
Rethinking the Truly Unsupervised Image-To-Image Translation
ICCV 2021
Viewpoint-Agnostic Change Captioning With Cycle Consistency
ICCV 2021
Commonality in Natural Images Rescues GANs: Pre-training GANs with Generic and Privacy-Free Synthetic Data
CVPR 2022
A Drone Video Clip Dataset and its Applications in Automated Cinematography
Computer Graphics Forum 2022
BaSSL: Boundary-Aware Self-Supervised Learning for Video Scene Segmentation
ACCV 2022
Self-Distilled Self-Supervised Representation Learning
WACV 23
Unifying Vision-Language Representation Space with Single-tower Transformer
AAAI 23
Towards Practical Plug-and-Play Diffusion Models
CVPR 2023
Exploring Discontinuity for Video Frame Interpolation
CVPR 2023
A Nonlinear, Regularized, and Data-independent Modulation for Continuously Interactive Image Processing Network
IJCV 2023
Bidirectional Attention Flow for Machine Comprehension
ICLR 2017
Zero-Shot Relation Extraction via Reading Comprehension
ACL 2017
Neural Speed Reading via Skim-RNN
ECCV 2020
1st place in LSMDC challenge in ICCV 2019
Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index
ACL 2019
Towards Continual Knowledge Learning of Language Models
ICLR 2022
Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective
ACL 2023

Ready to see your video differently?
Try your own video in our Playground to see next-level video intelligence in action.