TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Check out the TWLV-I technical report on arXiV and HuggingFace! The code is available on GitHub: https://github.com/twelvelabs-io/video-embeddings-evaluation-framework

‍

TLDR

Comprehensive Evaluation Framework: Twelve Labs introduces a robust evaluation framework for video understanding, emphasizing both appearance and motion analysis.
TWLV-I Model: Our new video foundation model (TWLV-I) excels in balancing these dual aspects, outperforming or matching state-of-the-art models.
Task Performance: TWLV-I shows strong results across action recognition, temporal action localization, and spatio-temporal action localization, highlighting its versatility.
Visualization Insights: t-SNE and LDA visualizations reveal TWLV-I's superior clustering and motion distinguishability capabilities compared to other models.
Future Directions: Emphasizes scaling up, improving image embedding, and expanding modalities to enhance TWLV-I's versatility and applicability.
Guiding Future Research: The proposed methodologies aim to set a new standard in video understanding, guiding future research and development in the field.

‍

1 - Introduction

In today's digital landscape, video serves as a universal language, seamlessly conveying complex ideas and emotions across cultures. Developing robust video understanding systems is crucial for accurately interpreting this rich medium. Videos, as sequences of images, require a dual focus: recognizing the appearance in each frame and understanding the motion that unfolds over time.

At Twelve Labs, we recognize the need for a comprehensive evaluation framework that addresses these dual aspects of video understanding. Our goal is to set a clear direction for future research in this field by establishing methodologies that accurately measure both appearance and motion capabilities.

‍

1.1 - Foundation Models and Video Understanding

Foundation Models (FMs) have revolutionized AI by enabling models to handle diverse tasks within a domain. While Language and Image Foundation Models have made significant strides, video understanding presents unique challenges. Existing Video Foundation Models often fall short in effectively capturing both appearance and motion, as evidenced by limitations in clustering and classification tasks.

‍

1.2 - Introducing TWLV-I and Our Evaluation Framework

To address these challenges, we introduce TWLV-I - designed to excel in both appearance and motion-based tasks, as shown in Figure 1. More importantly, we propose a robust evaluation framework that not only assesses TWLV-I's capabilities but also sets a benchmark for future models.

Our framework includes a range of tasks such as action recognition, temporal action localization, and spatio-temporal action localization, each meticulously designed to evaluate specific aspects of video understanding. Through this comprehensive approach, we aim to highlight the importance of balanced capabilities and guide the field towards more holistic model development.

By focusing on both the performance and proper evaluation of video models, we aspire to present a milestone in video research, paving the way for future advancements and innovations in this dynamic field.

‍

2 - TWLV-I & Video Foundation Model Evaluation Framework

TWLV-I's architecture and training process are designed to address the unique challenges of video understanding, balancing the need for both appearance and motion comprehension. Figure 2 provides a visual comparison of TWLV-I's performance on appearance-centric (Kinetics-400) versus motion-centric (SSv2 and Diving-48) benchmarks. The plot demonstrates TWLV-I's balanced capabilities, handling both types of tasks reasonably well.

This subsection delves into the key aspects of TWLV-I's training methodology and frame sampling techniques.

‍

2.1 - Architecture

TWLV-I is built upon the Vision Transformer (ViT) architecture, leveraging its powerful ability to process visual data. We implement two variants:

ViT-B (Base): A model with 86 million parameters
ViT-L (Large): An expanded version with 307 million parameters

The input video is tokenized into multiple patches, which are then processed through the transformer layers. This process yields patch-wise embeddings that are subsequently pooled to generate the overall embedding for the input video.

2.2 - Pretraining Dataset

To ensure robust and versatile video understanding, TWLV-I is pretrained on a diverse set of datasets, as detailed in Table 1:

Video datasets:
- Kinetics 710 (658K clips)
- HowTo360K (360K clips, a subset of HowTo100M)
- WebVid10M (10.73M clips)
Image datasets (total of 15M images):
- COCO (113K images)
- SBU Captions (860K images)
- Visual Genome (100K images)
- CC3M (2.88M images)
- CC12M (11.00M images)

This combination of video and image datasets enhances TWLV-I's ability to understand both motion dynamics and static visual features.

‍

2.3 - Training Objective

TWLV-I employs a masked modeling schema as its foundational training approach. However, to optimize the model's performance in both motion and appearance understanding, we diversify the reconstruction targets. This strategy aims to create a robust model capable of excelling across various video understanding tasks. The model is trained from scratch using this objective and the aforementioned datasets.

‍

2.4 - Frame Sampling

The frame sampling process is crucial due to the computational constraints of the ViT architecture. As the number of tokens increases, so does the computational complexity (quadratically). To manage this, we employ a strategic frame sampling technique called Multi-Clip Embedding (shown in Figure 4):

Clip Division: An input video is divided into M clips, each T seconds long.
Frame Selection: N frames are sampled from each clip.
Embedding Generation: This process results in M embeddings per video.
Flexible Representation: The number of embeddings increases proportionally with video length, allowing for variable-length video processing.
Single Embedding Option: When a single embedding is required to represent the entire video, we average the M embeddings.

This approach allows TWLV-I to handle videos of varying lengths efficiently while maintaining the ability to capture both short-term and long-term temporal dynamics.

By combining a powerful ViT architecture with a diverse pretraining dataset and an innovative frame sampling technique, TWLV-I is well-equipped to handle the complexities of video understanding tasks. This foundation enables TWLV-I to perform robustly across both appearance-centric and motion-centric benchmarks, as we will explore in subsequent sections of this evaluation framework.

3 - Action Recognition

Action Recognition (AR) is a fundamental task in video understanding, aiming to classify videos into predefined human action categories. This task serves as a crucial benchmark for evaluating the performance of video foundation models, as it requires both appearance and motion understanding.

‍

3.1 - Benchmarks

We evaluate TWLV-I's performance on five representative AR benchmarks, each with its unique characteristics:

Kinetics-400 (K400): A large-scale dataset focusing on appearance-based actions
Something-Something-v2 (SSv2): A motion-centric dataset emphasizing temporal relationships
Moments-in-Time (MiT): Another appearance-focused dataset with diverse action categories
Diving-48 (DV48): A fine-grained, motion-centric dataset of diving actions
Epic-Kitchens (EK): An egocentric dataset capturing daily kitchen activities

These benchmarks provide a comprehensive evaluation landscape, allowing us to assess TWLV-I's capabilities across both appearance-focused and motion-centric scenarios.

‍

3.2 - Evaluation Methodology

We employ a multi-view classification method, which is standard in action recognition tasks:

Spatial resizing of input videos to match the required input resolution
Uniform sampling of m clips along the spatial dimension and n clips along the temporal dimension, resulting in m × n total clips
Calculation of class probabilities for each clip
Averaging of probabilities to obtain the final output

This approach ensures a thorough evaluation of the model's performance across different spatial and temporal segments of the input videos.

‍

3.3 - Linear Probing

Linear probing is a technique used to evaluate the quality of learned representations without fine-tuning the entire model. It involves:

Freezing the feature extractor (backbone model)
Training a linear classifier on top of the frozen features

The linear classifier consists of a weight matrix that maps the embedding vector dimension to the number of action classes.

Results and Analysis

Table 3 presents the linear probing results across various benchmarks and models. Key observations include:

TWLV-I demonstrates strong performance across all benchmarks, particularly within its architecture scale (ViT-B and ViT-L).
For SSv2, TWLV-I outperforms larger models like ViT-H (DFN) and ViT-g (InternVideo2), with the exception of V-JEPA, which includes SSv2 in its pretraining.
TWLV-I's ViT-L variant surpasses larger-scale models on EK and DV48 benchmarks.

These results highlight TWLV-I's ability to learn rich, generalizable features that transfer well to various action recognition tasks, even when evaluated using a simple linear classifier.

‍

3.4 - Attentive Probing

While linear probing provides insights into clip-wise embedding quality, it may not fully capture the models' capabilities, especially for those trained with patch-level supervision. To address this limitation, we introduce attentive probing, which involves:

Training a single attention layer with a learnable class token on top of the frozen model
Passing the output class token through a linear classifier
Measuring the top-1 accuracy

This method allows for a more nuanced evaluation of the models' patch-wise representation capabilities.

Results and Analysis

Table 5 showcases the attentive probing results. Key findings include:

TWLV-I achieves superior performance across both appearance- and motion-centric benchmarks compared to other models.
The strong performance in attentive probing suggests that TWLV-I possesses richer patch-wise representations compared to its counterparts.

‍

3.5 - K-Nearest Neighbors

To provide a parameter-free evaluation of embedding quality and address potential biases in gradient-based methods, we employ the K-Nearest Neighbors (KNN) classification task. This non-parametric approach offers a fair comparison of embedding vectors across different model architectures.

We generate embeddings using two methods:

Uniform Embedding: A single embedding vector for the entire video.
Multi-Clip Embedding: Multiple embeddings from 2-second clips throughout the video.

For Multi-Clip Embedding, we employ two evaluation strategies:

Video-level: Averaging all clip embeddings to create a single video representation.
Clip-level: Summing votes received by each clip to determine the final class.

Results and Analysis

Table 6 presents the KNN classification results across different models and embedding strategies. Key observations include:

TWLV-I demonstrates competitive performance on both Kinetics-400 (K400) and Something-Something-v2 (SSv2) datasets, particularly when compared to models of similar scale.
In the Uniform embedding setting, TWLV-I-ViT-B achieves 57.51% and 19.82% top-1 accuracy on K400 and SSv2 respectively, outperforming UMT_s2 with the same architecture.
TWLV-I-ViT-L shows strong results, with 65.97% and 19.47% top-1 accuracy on K400 and SSv2, surpassing several larger models.

However, unlike the attentive probing results, TWLV-I underperforms compared to InternVideo2 on both K400 and SSv2 in the KNN evaluation. This suggests:

There's room for improvement in utilizing TWLV-I's embeddings in a non-parametric manner.
The need for further research on effectively representing embeddings for longer videos, especially those extending beyond a single clip length.

These results highlight the complexity of video representation and the challenges in creating a universally strong embedding across different evaluation methodologies.

‍

3.6 - Pre-Training with SSv2

To ensure a fair comparison with models like V-JEPA and to assess the impact of including motion-centric data in pretraining, we conducted an additional experiment by incorporating the Something-Something-v2 (SSv2) dataset into TWLV-I's pretraining phase.

We trained a variant of TWLV-I (ViT-L architecture) including SSv2 in its pretraining dataset alongside the original datasets. This allows us to directly compare the impact of this additional data on various benchmarks and evaluation methods.

‍

Results and Analysis

The results of this SSv2-enhanced model are presented at the bottom of Tables 3, 5, and 6. Key findings include:

Improved SSv2 Performance: Across all evaluation methods (linear probing, attentive probing, and KNN), the model shows significant improvements on the SSv2 benchmark. For example, in linear probing (Table 3), the top-1 accuracy on SSv2 increases from 46.41% to 48.14%.
General Performance Boost: The inclusion of SSv2 in pretraining leads to performance gains across other benchmarks as well. This suggests that the additional motion-centric data enhances the model's overall video understanding capabilities.
Comparison with Specialized Models: Despite the improvements, the KNN performance of our SSv2-enhanced model still lags behind V-JEPA and InternVideo2 on SSv2. This indicates that while TWLV-I's patch-level representations are strong, there's room for improvement in video-level representations.
Trade-offs in Representation: The results highlight a potential imbalance between TWLV-I's patch-level and video-level representations. Addressing this discrepancy should be a key focus for future enhancements to the model.

These findings underscore the importance of diverse pretraining data in developing robust video foundation models. They also point to specific areas where TWLV-I can be improved to better compete with specialized models on motion-centric tasks while maintaining its strong performance on appearance-based benchmarks.

‍

4 - ImageNet Classification

To assess the versatility of video foundation models and their potential to function as general vision models, we evaluated TWLV-I's performance on the ImageNet classification task. This benchmark provides insights into the model's ability to process and understand static images, which is crucial for a comprehensive video understanding system.

‍

4.1 - Results and Analysis

The ImageNet classification results are presented in the last columns of Table 3 (linear probing) and Table 5 (attentive probing). Key observations include:

Correlation with Appearance-Centric Tasks: Generally, models that excel in appearance-centric action recognition tasks (e.g., Kinetics-400) also demonstrate strong performance on the ImageNet benchmark. This correlation suggests that the features learned for video understanding can transfer effectively to static image classification.
TWLV-I's Performance:
- In linear probing (Table 3), TWLV-I-ViT-L achieves 72.98% Top-1 accuracy on ImageNet, which is competitive but not leading.
- For attentive probing (Table 5), TWLV-I-ViT-L shows improved performance with 79.19% Top-1 accuracy, indicating better utilization of learned features through attention mechanisms.
Comparison with Specialized Models: While TWLV-I shows strong performance on video-based tasks, there's a noticeable gap in ImageNet classification compared to some specialized models. For instance, InternVideo2 significantly outperforms TWLV-I on ImageNet, although this gap narrows or even reverses in video action recognition tasks.

‍

4.2 - Implications and Insights

Given such observations, we deduce the following insights:

Static Image Processing Limitations: The results suggest that TWLV-I has some limitations in processing static images compared to models optimized for this task. This indicates an area for potential improvement in future iterations of the model.
Motion Information Utilization: The narrowing performance gap between TWLV-I and other models in video tasks, despite lower ImageNet scores, suggests that TWLV-I effectively leverages motion information for video understanding. This ability complements its appearance-based features, resulting in strong overall video analysis performance.
Balancing Video and Image Capabilities: To evolve TWLV-I beyond being solely a video model, future research should focus on enhancing its ability to understand single images without compromising its strong video analysis capabilities.

Overall, while TWLV-I demonstrates strong capabilities in video understanding tasks, the ImageNet classification results highlight areas for improvement in static image processing. Addressing these limitations could lead to a more comprehensive visual foundation model capable of excelling across both video and image domains.

‍

5 - Temporal Action Localization

Temporal Action Localization (TAL) is a critical task in video understanding that involves identifying and localizing specific actions within untrimmed videos. This task is particularly important for applications in autonomous driving, sports analysis, and content-based video retrieval. TAL requires models to analyze long, complex videos and precisely determine both the temporal boundaries and class labels of distinct actions occurring within them.

‍

5.1 - Evaluation Perspectives

TAL evaluates video foundation models from two key perspectives:

Temporal Sensitivity: The ability to identify if an action of interest occurs at a specific time step.
Instance Discrimination: The capability to distinguish or group frame-wise segments into complete action instances.

While TAL is primarily designed as a motion-centric task, our analysis reveals that both appearance and motion capabilities contribute to achieving these aspects effectively.

‍

5.2 - Methodology

We evaluated TWLV-I and other video foundation models on two prominent TAL datasets:

ActivityNet-v1.3
THUMOS14

For the detection head, we employed ActionFormer. Our evaluation used two distinct methods:

Self-contained: The model performs both classification and regression without external assistance.
w/ External Classifier: The model conducts binary classification, while an external classifier predicts the action class.

Feature extraction followed the Multi-Clip Embedding method described in the "Frame Sampling" section.

‍

5.3 - Results and Analysis

Tables 9 and 10 present the comprehensive results for THUMOS14 and ActivityNet-v1.3, respectively.

THUMOS14 (Table 9)

TWLV-I (our model) consistently outperforms other models of the same scale across all metrics.
TWLV-I-ViT-L achieves the highest average mAP of 58.75% in the self-contained setting and 53.63% with an external classifier.
Notably, TWLV-I-ViT-L surpasses larger models like DFN and V-JEPA (ViT-H), demonstrating its robust generalization capabilities.

ActivityNet-v1.3 (Table 10)

TWLV-I again shows superior performance within its architecture scale.
TWLV-I-ViT-L achieves an average mAP of 34.98% (self-contained) and 39.49% (w/ external classifier).
Remarkably, TWLV-I performs exceptionally well at strict IoU thresholds (e.g., 0.95), indicating strong temporal sensitivity.

‍

5.4 - Key Insights

Given such observations, we deduce the following insights:

Scale Efficiency: TWLV-I's performance, especially at the ViT-L scale, outperforming larger models, highlights its efficient architecture and training methodology.
Temporal Precision: High performance at strict IoU thresholds (e.g., 0.95 in ActivityNet) demonstrates TWLV-I's superior temporal sensitivity.
Model Scale Impact: InternVideo2's top performance, particularly in self-contained evaluation, suggests that TAL benefits from larger model scales, especially when performing both classification and boundary regression simultaneously.
Evaluation Strategy: For a focused assessment of motion understanding, using an external classifier is recommended. However, for a comprehensive evaluation including appearance capabilities, the self-contained approach is preferable.
Generalization Capability: TWLV-I's strong performance across different datasets and evaluation methods indicates its robustness as a backbone for temporal localization tasks.

These results underscore TWLV-I's effectiveness in tackling the complex task of temporal action localization, showcasing its balanced capabilities in both temporal sensitivity and instance discrimination. The model's performance across various scales and evaluation methods demonstrates its potential as a versatile foundation for video understanding tasks.

‍

6 - Spatio-Temporal Action Localization

Spatio-Temporal Action Localization (STAL) is a complex task that evaluates a model's ability to not only recognize actions but also accurately localize them in both space and time within untrimmed videos. This task is crucial for applications requiring detailed action analysis, as it demands a comprehensive understanding of both appearance and motion.

‍

6.1 - Dataset and Evaluation

We evaluate models using the AVA v2.2 dataset, which consists of 430 movie clips, each 15 minutes long. Keyframes are annotated every second, resulting in 210,634 labeled frames in the training set and 57,371 in the validation set. The dataset includes 80 atomic actions labeled for each actor.

For evaluation, we report the Frame Average Precision (fAP) at an Intersection over Union (IoU) threshold of 0.5, using the latest v2.2 annotations. We adopt an end-to-end STAL framework, training the decoder while keeping the vision backbone frozen.

‍

6.2 - Results and Analysis

Table 11 presents the evaluation results for the STAL task. Key observations include:

Comparison with Other Models: TWLV-I, along with UMT and InternVideo2, outperforms DFN and V-JEPA. While DFN and V-JEPA exhibit opposite trends in other tasks, they show similar performance in STAL due to the dual requirements of instance localization and action recognition.
Challenges in Instance Localization:
- V-JEPA: Struggles with instance localization, which impairs its ability to perform action recognition effectively.
- DFN: Excels in localizing instances due to its strong appearance understanding but fails in action recognition due to limited temporal understanding.
Strengths of TWLV-I:
- TWLV-I demonstrates a balanced understanding of both appearance and motion, allowing it to perform well in STAL.
- At the ViT-L scale, TWLV-I achieves a fAP@0.5 of 27.39, outperforming both DFN and V-JEPA and closely matching the performance of larger models like InternVideo2.

‍

6.3 - Key Insights

Given such observations, we deduce the following insights:

Balanced Understanding: The results confirm that TWLV-I effectively understands spatio-temporal information, crucial for accurate STAL.
Comprehensive Evaluation: The STAL task, conducted in an end-to-end manner, serves as a robust measure of a model's ability to understand video comprehensively, encompassing both spatial and temporal dimensions.
Model Validation: The performance of TWLV-I, alongside UMT and InternVideo2, highlights its capability to handle the dual challenges of localization and recognition, setting it apart from models with a singular focus on appearance or motion.

These findings underscore the importance of a balanced approach to video understanding, where both appearance and motion are integrated to achieve superior performance in complex tasks like STAL.

‍

7 - Temporal Action Segmentation

Temporal Action Segmentation (TAS) is a crucial task for understanding and analyzing human activities in complex, extended videos. It has a wide range of applications, including video surveillance, summarization, and skill assessment. TAS involves processing untrimmed video inputs to generate a sequence of action class labels for each frame.

‍

7.1 - Benchmarks and Methodology

We evaluate TAS using three challenging benchmarks:

50Salads: Focuses on fine-grained actions in cooking scenarios.
GTEA: Captures egocentric views of daily activities.
Breakfast: Encompasses a broader view, including full-body actions.

For TAS, we employ ASFormer as the detection head. We report the average F1 scores at thresholds of 10, 25, and 50, referred to as ‘mF1’. Features are extracted using the Multi-Clip Embedding method and aggregated with spatial pooling to retain only the temporal axis.

‍

7.2 - Results and Analysis

Table 12 presents the TAS evaluation results across the three benchmarks:

50Salads: TWLV-I achieves an mF1 score of 80.69 with ViT-B and 80.60 with ViT-L, outperforming both appearance-centric models like UMT and DFN and the motion-centric model V-JEPA. The Edit and accuracy scores also reflect TWLV-I's strong alignment with ground-truth sequences.
GTEA: TWLV-I excels with an mF1 score of 88.26 (ViT-B) and 88.43 (ViT-L), surpassing all baselines. This demonstrates its robust performance in egocentric views, where precise action segmentation is critical.
Breakfast: Despite the broader view, TWLV-I maintains strong performance, achieving mF1 scores of 52.18 (ViT-B) and 50.66 (ViT-L). This highlights its versatility across different camera perspectives.

‍

7.3 - Key Insights

Given such observations, we deduce the following insights:

Balanced Capabilities: TWLV-I's superior performance across all datasets indicates a well-balanced understanding of both appearance and motion, crucial for accurate action segmentation.
Alignment with Ground Truth: High Edit and mF1 scores suggest that TWLV-I's predictions closely match the actual sequences of actions, enhancing its reliability for real-world applications.
Versatility Across Views: The model's ability to outperform baselines in various camera views, including top-down and egocentric perspectives, underscores its adaptability and robustness.
Comparison with InternVideo2: While TWLV-I performs comparably to InternVideo2 in specific views, its consistent excellence across different datasets and perspectives highlights its comprehensive capabilities.

These results affirm the effectiveness of our model in tackling the complex task of temporal action segmentation, showcasing its potential for diverse applications in video analysis.

‍

8 - Embedding Visualization

To gain deeper insights into the representational capabilities of TWLV-I and other video foundation models, we conducted visualization experiments using t-SNE and Linear Discriminant Analysis (LDA). These visualizations provide valuable information about the models' ability to capture both appearance-based and motion-based features.

‍

8.1 - t-SNE Visualization on Benchmark Datasets

We used t-SNE to visualize the embedding spaces of the Kinetics-400 (K400) and Something-Something-v2 (SSv2) validation sets, with different colors representing different action classes.

Kinetics-400 (Figure 6)

TWLV-I, UMT, and InternVideo2: These models demonstrate clear clustering of samples, with distinct decision boundaries for most classes. This indicates a strong ability to differentiate between various appearance-based actions.
V-JEPA: In contrast, V-JEPA fails to cluster the samples into distinct groups, suggesting limitations in capturing appearance-based features effectively.

Something-Something-v2 (Figure 7)

All Models: Interestingly, none of the models, including TWLV-I and V-JEPA, show effective clustering of classes in the SSv2 dataset. This aligns with the generally lower classification performance on SSv2 compared to K400.
Implications: The lack of clear clustering in SSv2 visualizations highlights the challenges in representing motion-centric actions and underscores the need for further research to improve performance on such benchmarks.

‍

8.2 - Directional Motion Distinguishability

To better understand the models' capabilities in capturing motion information, we focused on specific classes within SSv2 that require understanding directional motion, such as "Moving something up" and "Moving something down".

We visualized embeddings of original videos and their temporally flipped versions using Linear Discriminant Analysis (LDA). This approach allows us to assess whether models can encode directional motion information in their embedding vectors, independent of appearance.

Results (Figures 8 and 9)

TWLV-I: Demonstrates good separation between forward and reverse videos, indicating effective encoding of directional motion information.
V-JEPA: Shows the highest distinguishability between forward and reverse videos, likely due to its pretraining on the SSv2 dataset.
UMT and InternVideo2: Provide non-separable embeddings for forward and reverse videos, suggesting a limitation in capturing directional motion information despite their strong appearance understanding capabilities.

8.3 - Key Insights

Given such observations, we deduce the following insights:

Appearance vs. Motion Understanding: TWLV-I shows a balanced capability in capturing both appearance-based features (evident in K400 visualization) and motion-based features (demonstrated in directional motion analysis).
Challenges in Motion-Centric Tasks: The difficulty in clustering SSv2 classes highlights the complexity of representing motion-centric actions, a challenge faced by all evaluated models.
Directional Motion Encoding: TWLV-I's ability to distinguish between forward and reverse videos indicates its effectiveness in encoding fine-grained motion information, a crucial aspect for comprehensive video understanding.
Model Specialization: The results reveal the strengths and limitations of different models, with some excelling in appearance-based tasks while struggling with motion-centric challenges.

These visualizations provide valuable insights into the representational capabilities of TWLV-I and other video foundation models, highlighting areas of strength and identifying opportunities for further improvement in capturing complex motion dynamics.

‍

9 - Future Directions

As we look ahead, there are several promising avenues for further development and enhancement of TWLV-I and video foundation models in general. These directions aim to expand the model's capabilities, improve its performance, and broaden its applicability across various domains of visual understanding.

‍

9.1 - Scaling Up

Our current focus has been on two primary scales of models: ViT-B and ViT-L. The ViT-L variant of TWLV-I has already demonstrated performance comparable to, and in some cases surpassing, larger models such as ViT-H or ViT-g. This success suggests significant potential for further improvements through scaling:

Increased Model Size: Expanding TWLV-I to even larger architectures could potentially yield substantial performance gains across various tasks.
Large-Scale Datasets: Utilizing extensive, in-house collected datasets for training could lead to the development of more general and robust models. This approach could help in capturing a wider range of visual concepts and scenarios.
Efficient Scaling Techniques: Exploring methods to scale the model efficiently, such as sparse attention mechanisms or mixture-of-experts architectures, could allow for increased model capacity without proportional increases in computational requirements.

‍

9.2 - Enhancing Image Embedding Capabilities

While TWLV-I has shown strong performance in video-related tasks, there's room for improvement in its image embedding capabilities:

Single-Frame Processing: Enhancing TWLV-I's ability to process and understand single-frame inputs (i.e., images) is crucial. This improvement would bridge the gap between video and image understanding, making TWLV-I a more versatile visual foundation model.
Transfer Learning: Developing techniques to effectively transfer knowledge from video understanding to image tasks could lead to improved performance in image-specific applications.
Unified Architecture: Exploring architectural modifications that allow for seamless processing of both videos and images within the same model framework.

‍

9.3 - Expanding Modality

To further enhance TWLV-I's versatility and align it with broader applications in AI, expanding its modality is a key direction:

Multimodal Tasks: Extending TWLV-I to handle multimodal tasks such as video-text retrieval and video captioning. This expansion would require aligning the model with text domains while maintaining its strong unimodal video understanding capabilities.
Video-Language Models: Developing TWLV-I as a more effective vision encoder for video-language models, enabling better integration with language processing tasks.
Audio Integration: Incorporating audio processing capabilities to enable tasks such as audio-visual synchronization and sound event detection in videos.
Evaluation Methodologies: Proposing and standardizing comprehensive evaluation methodologies for these multimodal tasks to ensure thorough assessment and validation of model performance.

By pursuing these future directions, we aim to evolve TWLV-I from a specialized video foundation model to a comprehensive visual understanding system capable of handling a wide range of tasks across different modalities. This evolution will not only enhance the model's performance in existing applications but also open up new possibilities in fields such as human-computer interaction, content creation, and advanced AI assistants.

‍

10 - Conclusion

In this comprehensive exploration of video understanding, we have emphasized the critical importance of both appearance and motion as fundamental elements in comprehending visual content. Our research has led to several key outcomes and insights:

Dual-Aspect Evaluation: We have presented methodologies for measuring both appearance and motion capabilities across various tasks, providing a holistic approach to assessing video foundation models.
Limitations of Existing Models: Through rigorous experiments, we have demonstrated that many existing models struggle to excel in both appearance and motion understanding simultaneously, highlighting a significant gap in the field.
Introduction of TWLV-I: In response to these challenges, we have proposed TWLV-I, a novel approach that demonstrates robust performance in both motion and appearance-based tasks.
Versatile Embeddings: The embeddings generated by TWLV-I show promise for wide-ranging applications and further research in various downstream tasks, potentially accelerating progress in video understanding.
Evaluation Framework: The evaluation and analysis methods introduced in this post offer a comprehensive framework for assessing video foundation models, which we hope will be widely adopted in the field.
Future Directions: Our work points towards potential avenues for future research, including scaling up model size, enhancing image embedding capabilities, and expanding into multimodal tasks.
Guiding the Field: By addressing both appearance and motion aspects, our approach aims to set a new standard and direction for the video understanding domain.

In conclusion, TWLV-I represents a significant step forward in creating more versatile and robust video foundation models. By balancing the capabilities of appearance and motion understanding, it opens up new possibilities for advanced video analysis and comprehension. We anticipate that the methodologies and insights presented here will drive further innovations in video AI and inspire researchers to approach video understanding with a more holistic perspective.

As we move forward, the continued development and refinement of such comprehensive models will be crucial in unlocking the full potential of video understanding technologies across various applications, from content analysis and creation to advanced AI assistants and beyond. We invite the research community to build upon these foundations, pushing the boundaries of what's possible in the realm of video AI.

‍

Twelve Labs Team

This work was achieved through the combined efforts (equal contribution) of the core contributors, with significant support from the Twelve Labs ML Research and ML Data teams.

‍

Core Contributors

Hyeongmin Lee, Research Scientist

‍Jin-Young Kim, Research Scientist

‍Kyungjune Baek, Research Scientist

‍Jihwan Kim, Research Scientist

‍Aiden Lee, CTO

‍

Contributors (first-name alphabetical order)

Aaron (Jangwon) Lee, ML Data Intern\

‍Calvin (Minjoon) Seo, Chief Scientist

‍Cooper (Seokjin) Han, Research Scientist

‍Daniel (GeunOh) Kim, ML Data Engineer

‍Flynn (Jiho) Jang, Research Scientist

Ian (Soonwoo) Kwon, Research Scientist

‍Jay Suh, ML Data Engineer

Jay (Jaehyuk) Yi, Research Scientist

‍Jayden (Junwan) Kim, Research Intern

‍Jeff (Jongseok) Kim, Research Scientist

‍Kyle (Seungjoon) Park, Research Scientist

‍Leo (Daewoo) Kim, Research Scientist

‍Mars (Seongsu) Ha, Research Scientist

‍Max (JongMok) Kim, Research Scientist

‍Ray (Raehyuk) Jung, Research Scientist

‍William (Hyojun) Go, Research Scientist

‍

Citation

If you think this project is helpful, please feel free to leave a star and cite our paper:

@inproceedings{twelvelabs2024twlv,
  title={TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models},
  author={Twelve Labs},
  year={2024}
}

Check out the TWLV-I technical report on arXiV and HuggingFace! The code is available on GitHub: https://github.com/twelvelabs-io/video-embeddings-evaluation-framework

‍

TLDR

Comprehensive Evaluation Framework: Twelve Labs introduces a robust evaluation framework for video understanding, emphasizing both appearance and motion analysis.
TWLV-I Model: Our new video foundation model (TWLV-I) excels in balancing these dual aspects, outperforming or matching state-of-the-art models.
Task Performance: TWLV-I shows strong results across action recognition, temporal action localization, and spatio-temporal action localization, highlighting its versatility.
Visualization Insights: t-SNE and LDA visualizations reveal TWLV-I's superior clustering and motion distinguishability capabilities compared to other models.
Future Directions: Emphasizes scaling up, improving image embedding, and expanding modalities to enhance TWLV-I's versatility and applicability.
Guiding Future Research: The proposed methodologies aim to set a new standard in video understanding, guiding future research and development in the field.

‍

1 - Introduction

In today's digital landscape, video serves as a universal language, seamlessly conveying complex ideas and emotions across cultures. Developing robust video understanding systems is crucial for accurately interpreting this rich medium. Videos, as sequences of images, require a dual focus: recognizing the appearance in each frame and understanding the motion that unfolds over time.

At Twelve Labs, we recognize the need for a comprehensive evaluation framework that addresses these dual aspects of video understanding. Our goal is to set a clear direction for future research in this field by establishing methodologies that accurately measure both appearance and motion capabilities.

‍

1.1 - Foundation Models and Video Understanding

Foundation Models (FMs) have revolutionized AI by enabling models to handle diverse tasks within a domain. While Language and Image Foundation Models have made significant strides, video understanding presents unique challenges. Existing Video Foundation Models often fall short in effectively capturing both appearance and motion, as evidenced by limitations in clustering and classification tasks.

‍

1.2 - Introducing TWLV-I and Our Evaluation Framework

To address these challenges, we introduce TWLV-I - designed to excel in both appearance and motion-based tasks, as shown in Figure 1. More importantly, we propose a robust evaluation framework that not only assesses TWLV-I's capabilities but also sets a benchmark for future models.

Our framework includes a range of tasks such as action recognition, temporal action localization, and spatio-temporal action localization, each meticulously designed to evaluate specific aspects of video understanding. Through this comprehensive approach, we aim to highlight the importance of balanced capabilities and guide the field towards more holistic model development.

By focusing on both the performance and proper evaluation of video models, we aspire to present a milestone in video research, paving the way for future advancements and innovations in this dynamic field.

‍

2 - TWLV-I & Video Foundation Model Evaluation Framework

TWLV-I's architecture and training process are designed to address the unique challenges of video understanding, balancing the need for both appearance and motion comprehension. Figure 2 provides a visual comparison of TWLV-I's performance on appearance-centric (Kinetics-400) versus motion-centric (SSv2 and Diving-48) benchmarks. The plot demonstrates TWLV-I's balanced capabilities, handling both types of tasks reasonably well.

This subsection delves into the key aspects of TWLV-I's training methodology and frame sampling techniques.

‍

2.1 - Architecture

TWLV-I is built upon the Vision Transformer (ViT) architecture, leveraging its powerful ability to process visual data. We implement two variants:

ViT-B (Base): A model with 86 million parameters
ViT-L (Large): An expanded version with 307 million parameters

The input video is tokenized into multiple patches, which are then processed through the transformer layers. This process yields patch-wise embeddings that are subsequently pooled to generate the overall embedding for the input video.

2.2 - Pretraining Dataset

To ensure robust and versatile video understanding, TWLV-I is pretrained on a diverse set of datasets, as detailed in Table 1:

Video datasets:
- Kinetics 710 (658K clips)
- HowTo360K (360K clips, a subset of HowTo100M)
- WebVid10M (10.73M clips)
Image datasets (total of 15M images):
- COCO (113K images)
- SBU Captions (860K images)
- Visual Genome (100K images)
- CC3M (2.88M images)
- CC12M (11.00M images)

This combination of video and image datasets enhances TWLV-I's ability to understand both motion dynamics and static visual features.

‍

2.3 - Training Objective

TWLV-I employs a masked modeling schema as its foundational training approach. However, to optimize the model's performance in both motion and appearance understanding, we diversify the reconstruction targets. This strategy aims to create a robust model capable of excelling across various video understanding tasks. The model is trained from scratch using this objective and the aforementioned datasets.

‍

2.4 - Frame Sampling

The frame sampling process is crucial due to the computational constraints of the ViT architecture. As the number of tokens increases, so does the computational complexity (quadratically). To manage this, we employ a strategic frame sampling technique called Multi-Clip Embedding (shown in Figure 4):

Clip Division: An input video is divided into M clips, each T seconds long.
Frame Selection: N frames are sampled from each clip.
Embedding Generation: This process results in M embeddings per video.
Flexible Representation: The number of embeddings increases proportionally with video length, allowing for variable-length video processing.
Single Embedding Option: When a single embedding is required to represent the entire video, we average the M embeddings.

This approach allows TWLV-I to handle videos of varying lengths efficiently while maintaining the ability to capture both short-term and long-term temporal dynamics.

By combining a powerful ViT architecture with a diverse pretraining dataset and an innovative frame sampling technique, TWLV-I is well-equipped to handle the complexities of video understanding tasks. This foundation enables TWLV-I to perform robustly across both appearance-centric and motion-centric benchmarks, as we will explore in subsequent sections of this evaluation framework.

3 - Action Recognition

Action Recognition (AR) is a fundamental task in video understanding, aiming to classify videos into predefined human action categories. This task serves as a crucial benchmark for evaluating the performance of video foundation models, as it requires both appearance and motion understanding.

‍

3.1 - Benchmarks

We evaluate TWLV-I's performance on five representative AR benchmarks, each with its unique characteristics:

Kinetics-400 (K400): A large-scale dataset focusing on appearance-based actions
Something-Something-v2 (SSv2): A motion-centric dataset emphasizing temporal relationships
Moments-in-Time (MiT): Another appearance-focused dataset with diverse action categories
Diving-48 (DV48): A fine-grained, motion-centric dataset of diving actions
Epic-Kitchens (EK): An egocentric dataset capturing daily kitchen activities

These benchmarks provide a comprehensive evaluation landscape, allowing us to assess TWLV-I's capabilities across both appearance-focused and motion-centric scenarios.

‍

3.2 - Evaluation Methodology

We employ a multi-view classification method, which is standard in action recognition tasks:

Spatial resizing of input videos to match the required input resolution
Uniform sampling of m clips along the spatial dimension and n clips along the temporal dimension, resulting in m × n total clips
Calculation of class probabilities for each clip
Averaging of probabilities to obtain the final output

This approach ensures a thorough evaluation of the model's performance across different spatial and temporal segments of the input videos.

‍

3.3 - Linear Probing

Linear probing is a technique used to evaluate the quality of learned representations without fine-tuning the entire model. It involves:

Freezing the feature extractor (backbone model)
Training a linear classifier on top of the frozen features

The linear classifier consists of a weight matrix that maps the embedding vector dimension to the number of action classes.

Results and Analysis

Table 3 presents the linear probing results across various benchmarks and models. Key observations include:

TWLV-I demonstrates strong performance across all benchmarks, particularly within its architecture scale (ViT-B and ViT-L).
For SSv2, TWLV-I outperforms larger models like ViT-H (DFN) and ViT-g (InternVideo2), with the exception of V-JEPA, which includes SSv2 in its pretraining.
TWLV-I's ViT-L variant surpasses larger-scale models on EK and DV48 benchmarks.

These results highlight TWLV-I's ability to learn rich, generalizable features that transfer well to various action recognition tasks, even when evaluated using a simple linear classifier.

‍

3.4 - Attentive Probing

While linear probing provides insights into clip-wise embedding quality, it may not fully capture the models' capabilities, especially for those trained with patch-level supervision. To address this limitation, we introduce attentive probing, which involves:

Training a single attention layer with a learnable class token on top of the frozen model
Passing the output class token through a linear classifier
Measuring the top-1 accuracy

This method allows for a more nuanced evaluation of the models' patch-wise representation capabilities.

Results and Analysis

Table 5 showcases the attentive probing results. Key findings include:

TWLV-I achieves superior performance across both appearance- and motion-centric benchmarks compared to other models.
The strong performance in attentive probing suggests that TWLV-I possesses richer patch-wise representations compared to its counterparts.

‍

3.5 - K-Nearest Neighbors

To provide a parameter-free evaluation of embedding quality and address potential biases in gradient-based methods, we employ the K-Nearest Neighbors (KNN) classification task. This non-parametric approach offers a fair comparison of embedding vectors across different model architectures.

We generate embeddings using two methods:

Uniform Embedding: A single embedding vector for the entire video.
Multi-Clip Embedding: Multiple embeddings from 2-second clips throughout the video.

For Multi-Clip Embedding, we employ two evaluation strategies:

Video-level: Averaging all clip embeddings to create a single video representation.
Clip-level: Summing votes received by each clip to determine the final class.

Results and Analysis

Table 6 presents the KNN classification results across different models and embedding strategies. Key observations include:

TWLV-I demonstrates competitive performance on both Kinetics-400 (K400) and Something-Something-v2 (SSv2) datasets, particularly when compared to models of similar scale.
In the Uniform embedding setting, TWLV-I-ViT-B achieves 57.51% and 19.82% top-1 accuracy on K400 and SSv2 respectively, outperforming UMT_s2 with the same architecture.
TWLV-I-ViT-L shows strong results, with 65.97% and 19.47% top-1 accuracy on K400 and SSv2, surpassing several larger models.

However, unlike the attentive probing results, TWLV-I underperforms compared to InternVideo2 on both K400 and SSv2 in the KNN evaluation. This suggests:

There's room for improvement in utilizing TWLV-I's embeddings in a non-parametric manner.
The need for further research on effectively representing embeddings for longer videos, especially those extending beyond a single clip length.

These results highlight the complexity of video representation and the challenges in creating a universally strong embedding across different evaluation methodologies.

‍

3.6 - Pre-Training with SSv2

To ensure a fair comparison with models like V-JEPA and to assess the impact of including motion-centric data in pretraining, we conducted an additional experiment by incorporating the Something-Something-v2 (SSv2) dataset into TWLV-I's pretraining phase.

We trained a variant of TWLV-I (ViT-L architecture) including SSv2 in its pretraining dataset alongside the original datasets. This allows us to directly compare the impact of this additional data on various benchmarks and evaluation methods.

‍

Results and Analysis

The results of this SSv2-enhanced model are presented at the bottom of Tables 3, 5, and 6. Key findings include:

Improved SSv2 Performance: Across all evaluation methods (linear probing, attentive probing, and KNN), the model shows significant improvements on the SSv2 benchmark. For example, in linear probing (Table 3), the top-1 accuracy on SSv2 increases from 46.41% to 48.14%.
General Performance Boost: The inclusion of SSv2 in pretraining leads to performance gains across other benchmarks as well. This suggests that the additional motion-centric data enhances the model's overall video understanding capabilities.
Comparison with Specialized Models: Despite the improvements, the KNN performance of our SSv2-enhanced model still lags behind V-JEPA and InternVideo2 on SSv2. This indicates that while TWLV-I's patch-level representations are strong, there's room for improvement in video-level representations.
Trade-offs in Representation: The results highlight a potential imbalance between TWLV-I's patch-level and video-level representations. Addressing this discrepancy should be a key focus for future enhancements to the model.

These findings underscore the importance of diverse pretraining data in developing robust video foundation models. They also point to specific areas where TWLV-I can be improved to better compete with specialized models on motion-centric tasks while maintaining its strong performance on appearance-based benchmarks.

‍

4 - ImageNet Classification

To assess the versatility of video foundation models and their potential to function as general vision models, we evaluated TWLV-I's performance on the ImageNet classification task. This benchmark provides insights into the model's ability to process and understand static images, which is crucial for a comprehensive video understanding system.

‍

4.1 - Results and Analysis

The ImageNet classification results are presented in the last columns of Table 3 (linear probing) and Table 5 (attentive probing). Key observations include:

Correlation with Appearance-Centric Tasks: Generally, models that excel in appearance-centric action recognition tasks (e.g., Kinetics-400) also demonstrate strong performance on the ImageNet benchmark. This correlation suggests that the features learned for video understanding can transfer effectively to static image classification.
TWLV-I's Performance:
- In linear probing (Table 3), TWLV-I-ViT-L achieves 72.98% Top-1 accuracy on ImageNet, which is competitive but not leading.
- For attentive probing (Table 5), TWLV-I-ViT-L shows improved performance with 79.19% Top-1 accuracy, indicating better utilization of learned features through attention mechanisms.
Comparison with Specialized Models: While TWLV-I shows strong performance on video-based tasks, there's a noticeable gap in ImageNet classification compared to some specialized models. For instance, InternVideo2 significantly outperforms TWLV-I on ImageNet, although this gap narrows or even reverses in video action recognition tasks.

‍

4.2 - Implications and Insights

Given such observations, we deduce the following insights:

Static Image Processing Limitations: The results suggest that TWLV-I has some limitations in processing static images compared to models optimized for this task. This indicates an area for potential improvement in future iterations of the model.
Motion Information Utilization: The narrowing performance gap between TWLV-I and other models in video tasks, despite lower ImageNet scores, suggests that TWLV-I effectively leverages motion information for video understanding. This ability complements its appearance-based features, resulting in strong overall video analysis performance.
Balancing Video and Image Capabilities: To evolve TWLV-I beyond being solely a video model, future research should focus on enhancing its ability to understand single images without compromising its strong video analysis capabilities.

Overall, while TWLV-I demonstrates strong capabilities in video understanding tasks, the ImageNet classification results highlight areas for improvement in static image processing. Addressing these limitations could lead to a more comprehensive visual foundation model capable of excelling across both video and image domains.

‍

5 - Temporal Action Localization

Temporal Action Localization (TAL) is a critical task in video understanding that involves identifying and localizing specific actions within untrimmed videos. This task is particularly important for applications in autonomous driving, sports analysis, and content-based video retrieval. TAL requires models to analyze long, complex videos and precisely determine both the temporal boundaries and class labels of distinct actions occurring within them.

‍

5.1 - Evaluation Perspectives

TAL evaluates video foundation models from two key perspectives:

Temporal Sensitivity: The ability to identify if an action of interest occurs at a specific time step.
Instance Discrimination: The capability to distinguish or group frame-wise segments into complete action instances.

While TAL is primarily designed as a motion-centric task, our analysis reveals that both appearance and motion capabilities contribute to achieving these aspects effectively.

‍

5.2 - Methodology

We evaluated TWLV-I and other video foundation models on two prominent TAL datasets:

ActivityNet-v1.3
THUMOS14

For the detection head, we employed ActionFormer. Our evaluation used two distinct methods:

Self-contained: The model performs both classification and regression without external assistance.
w/ External Classifier: The model conducts binary classification, while an external classifier predicts the action class.

Feature extraction followed the Multi-Clip Embedding method described in the "Frame Sampling" section.

‍

5.3 - Results and Analysis

Tables 9 and 10 present the comprehensive results for THUMOS14 and ActivityNet-v1.3, respectively.

THUMOS14 (Table 9)

TWLV-I (our model) consistently outperforms other models of the same scale across all metrics.
TWLV-I-ViT-L achieves the highest average mAP of 58.75% in the self-contained setting and 53.63% with an external classifier.
Notably, TWLV-I-ViT-L surpasses larger models like DFN and V-JEPA (ViT-H), demonstrating its robust generalization capabilities.

ActivityNet-v1.3 (Table 10)

TWLV-I again shows superior performance within its architecture scale.
TWLV-I-ViT-L achieves an average mAP of 34.98% (self-contained) and 39.49% (w/ external classifier).
Remarkably, TWLV-I performs exceptionally well at strict IoU thresholds (e.g., 0.95), indicating strong temporal sensitivity.

‍

5.4 - Key Insights

Given such observations, we deduce the following insights:

Scale Efficiency: TWLV-I's performance, especially at the ViT-L scale, outperforming larger models, highlights its efficient architecture and training methodology.
Temporal Precision: High performance at strict IoU thresholds (e.g., 0.95 in ActivityNet) demonstrates TWLV-I's superior temporal sensitivity.
Model Scale Impact: InternVideo2's top performance, particularly in self-contained evaluation, suggests that TAL benefits from larger model scales, especially when performing both classification and boundary regression simultaneously.
Evaluation Strategy: For a focused assessment of motion understanding, using an external classifier is recommended. However, for a comprehensive evaluation including appearance capabilities, the self-contained approach is preferable.
Generalization Capability: TWLV-I's strong performance across different datasets and evaluation methods indicates its robustness as a backbone for temporal localization tasks.

These results underscore TWLV-I's effectiveness in tackling the complex task of temporal action localization, showcasing its balanced capabilities in both temporal sensitivity and instance discrimination. The model's performance across various scales and evaluation methods demonstrates its potential as a versatile foundation for video understanding tasks.

‍

6 - Spatio-Temporal Action Localization

Spatio-Temporal Action Localization (STAL) is a complex task that evaluates a model's ability to not only recognize actions but also accurately localize them in both space and time within untrimmed videos. This task is crucial for applications requiring detailed action analysis, as it demands a comprehensive understanding of both appearance and motion.

‍

6.1 - Dataset and Evaluation

We evaluate models using the AVA v2.2 dataset, which consists of 430 movie clips, each 15 minutes long. Keyframes are annotated every second, resulting in 210,634 labeled frames in the training set and 57,371 in the validation set. The dataset includes 80 atomic actions labeled for each actor.

For evaluation, we report the Frame Average Precision (fAP) at an Intersection over Union (IoU) threshold of 0.5, using the latest v2.2 annotations. We adopt an end-to-end STAL framework, training the decoder while keeping the vision backbone frozen.

‍

6.2 - Results and Analysis

Table 11 presents the evaluation results for the STAL task. Key observations include:

Comparison with Other Models: TWLV-I, along with UMT and InternVideo2, outperforms DFN and V-JEPA. While DFN and V-JEPA exhibit opposite trends in other tasks, they show similar performance in STAL due to the dual requirements of instance localization and action recognition.
Challenges in Instance Localization:
- V-JEPA: Struggles with instance localization, which impairs its ability to perform action recognition effectively.
- DFN: Excels in localizing instances due to its strong appearance understanding but fails in action recognition due to limited temporal understanding.
Strengths of TWLV-I:
- TWLV-I demonstrates a balanced understanding of both appearance and motion, allowing it to perform well in STAL.
- At the ViT-L scale, TWLV-I achieves a fAP@0.5 of 27.39, outperforming both DFN and V-JEPA and closely matching the performance of larger models like InternVideo2.

‍

6.3 - Key Insights

Given such observations, we deduce the following insights:

Balanced Understanding: The results confirm that TWLV-I effectively understands spatio-temporal information, crucial for accurate STAL.
Comprehensive Evaluation: The STAL task, conducted in an end-to-end manner, serves as a robust measure of a model's ability to understand video comprehensively, encompassing both spatial and temporal dimensions.
Model Validation: The performance of TWLV-I, alongside UMT and InternVideo2, highlights its capability to handle the dual challenges of localization and recognition, setting it apart from models with a singular focus on appearance or motion.

These findings underscore the importance of a balanced approach to video understanding, where both appearance and motion are integrated to achieve superior performance in complex tasks like STAL.

‍

7 - Temporal Action Segmentation

Temporal Action Segmentation (TAS) is a crucial task for understanding and analyzing human activities in complex, extended videos. It has a wide range of applications, including video surveillance, summarization, and skill assessment. TAS involves processing untrimmed video inputs to generate a sequence of action class labels for each frame.

‍

7.1 - Benchmarks and Methodology

We evaluate TAS using three challenging benchmarks:

50Salads: Focuses on fine-grained actions in cooking scenarios.
GTEA: Captures egocentric views of daily activities.
Breakfast: Encompasses a broader view, including full-body actions.

For TAS, we employ ASFormer as the detection head. We report the average F1 scores at thresholds of 10, 25, and 50, referred to as ‘mF1’. Features are extracted using the Multi-Clip Embedding method and aggregated with spatial pooling to retain only the temporal axis.

‍

7.2 - Results and Analysis

Table 12 presents the TAS evaluation results across the three benchmarks:

50Salads: TWLV-I achieves an mF1 score of 80.69 with ViT-B and 80.60 with ViT-L, outperforming both appearance-centric models like UMT and DFN and the motion-centric model V-JEPA. The Edit and accuracy scores also reflect TWLV-I's strong alignment with ground-truth sequences.
GTEA: TWLV-I excels with an mF1 score of 88.26 (ViT-B) and 88.43 (ViT-L), surpassing all baselines. This demonstrates its robust performance in egocentric views, where precise action segmentation is critical.
Breakfast: Despite the broader view, TWLV-I maintains strong performance, achieving mF1 scores of 52.18 (ViT-B) and 50.66 (ViT-L). This highlights its versatility across different camera perspectives.

‍

7.3 - Key Insights

Given such observations, we deduce the following insights:

Balanced Capabilities: TWLV-I's superior performance across all datasets indicates a well-balanced understanding of both appearance and motion, crucial for accurate action segmentation.
Alignment with Ground Truth: High Edit and mF1 scores suggest that TWLV-I's predictions closely match the actual sequences of actions, enhancing its reliability for real-world applications.
Versatility Across Views: The model's ability to outperform baselines in various camera views, including top-down and egocentric perspectives, underscores its adaptability and robustness.
Comparison with InternVideo2: While TWLV-I performs comparably to InternVideo2 in specific views, its consistent excellence across different datasets and perspectives highlights its comprehensive capabilities.

These results affirm the effectiveness of our model in tackling the complex task of temporal action segmentation, showcasing its potential for diverse applications in video analysis.

‍

8 - Embedding Visualization

To gain deeper insights into the representational capabilities of TWLV-I and other video foundation models, we conducted visualization experiments using t-SNE and Linear Discriminant Analysis (LDA). These visualizations provide valuable information about the models' ability to capture both appearance-based and motion-based features.

‍

8.1 - t-SNE Visualization on Benchmark Datasets

We used t-SNE to visualize the embedding spaces of the Kinetics-400 (K400) and Something-Something-v2 (SSv2) validation sets, with different colors representing different action classes.

Kinetics-400 (Figure 6)

TWLV-I, UMT, and InternVideo2: These models demonstrate clear clustering of samples, with distinct decision boundaries for most classes. This indicates a strong ability to differentiate between various appearance-based actions.
V-JEPA: In contrast, V-JEPA fails to cluster the samples into distinct groups, suggesting limitations in capturing appearance-based features effectively.

Something-Something-v2 (Figure 7)

All Models: Interestingly, none of the models, including TWLV-I and V-JEPA, show effective clustering of classes in the SSv2 dataset. This aligns with the generally lower classification performance on SSv2 compared to K400.
Implications: The lack of clear clustering in SSv2 visualizations highlights the challenges in representing motion-centric actions and underscores the need for further research to improve performance on such benchmarks.

‍

8.2 - Directional Motion Distinguishability

To better understand the models' capabilities in capturing motion information, we focused on specific classes within SSv2 that require understanding directional motion, such as "Moving something up" and "Moving something down".

We visualized embeddings of original videos and their temporally flipped versions using Linear Discriminant Analysis (LDA). This approach allows us to assess whether models can encode directional motion information in their embedding vectors, independent of appearance.

Results (Figures 8 and 9)

TWLV-I: Demonstrates good separation between forward and reverse videos, indicating effective encoding of directional motion information.
V-JEPA: Shows the highest distinguishability between forward and reverse videos, likely due to its pretraining on the SSv2 dataset.
UMT and InternVideo2: Provide non-separable embeddings for forward and reverse videos, suggesting a limitation in capturing directional motion information despite their strong appearance understanding capabilities.

8.3 - Key Insights

Given such observations, we deduce the following insights:

Appearance vs. Motion Understanding: TWLV-I shows a balanced capability in capturing both appearance-based features (evident in K400 visualization) and motion-based features (demonstrated in directional motion analysis).
Challenges in Motion-Centric Tasks: The difficulty in clustering SSv2 classes highlights the complexity of representing motion-centric actions, a challenge faced by all evaluated models.
Directional Motion Encoding: TWLV-I's ability to distinguish between forward and reverse videos indicates its effectiveness in encoding fine-grained motion information, a crucial aspect for comprehensive video understanding.
Model Specialization: The results reveal the strengths and limitations of different models, with some excelling in appearance-based tasks while struggling with motion-centric challenges.

These visualizations provide valuable insights into the representational capabilities of TWLV-I and other video foundation models, highlighting areas of strength and identifying opportunities for further improvement in capturing complex motion dynamics.

‍

9 - Future Directions

As we look ahead, there are several promising avenues for further development and enhancement of TWLV-I and video foundation models in general. These directions aim to expand the model's capabilities, improve its performance, and broaden its applicability across various domains of visual understanding.

‍

9.1 - Scaling Up

Our current focus has been on two primary scales of models: ViT-B and ViT-L. The ViT-L variant of TWLV-I has already demonstrated performance comparable to, and in some cases surpassing, larger models such as ViT-H or ViT-g. This success suggests significant potential for further improvements through scaling:

Increased Model Size: Expanding TWLV-I to even larger architectures could potentially yield substantial performance gains across various tasks.
Large-Scale Datasets: Utilizing extensive, in-house collected datasets for training could lead to the development of more general and robust models. This approach could help in capturing a wider range of visual concepts and scenarios.
Efficient Scaling Techniques: Exploring methods to scale the model efficiently, such as sparse attention mechanisms or mixture-of-experts architectures, could allow for increased model capacity without proportional increases in computational requirements.

‍

9.2 - Enhancing Image Embedding Capabilities

While TWLV-I has shown strong performance in video-related tasks, there's room for improvement in its image embedding capabilities:

Single-Frame Processing: Enhancing TWLV-I's ability to process and understand single-frame inputs (i.e., images) is crucial. This improvement would bridge the gap between video and image understanding, making TWLV-I a more versatile visual foundation model.
Transfer Learning: Developing techniques to effectively transfer knowledge from video understanding to image tasks could lead to improved performance in image-specific applications.
Unified Architecture: Exploring architectural modifications that allow for seamless processing of both videos and images within the same model framework.

‍

9.3 - Expanding Modality

To further enhance TWLV-I's versatility and align it with broader applications in AI, expanding its modality is a key direction:

Multimodal Tasks: Extending TWLV-I to handle multimodal tasks such as video-text retrieval and video captioning. This expansion would require aligning the model with text domains while maintaining its strong unimodal video understanding capabilities.
Video-Language Models: Developing TWLV-I as a more effective vision encoder for video-language models, enabling better integration with language processing tasks.
Audio Integration: Incorporating audio processing capabilities to enable tasks such as audio-visual synchronization and sound event detection in videos.
Evaluation Methodologies: Proposing and standardizing comprehensive evaluation methodologies for these multimodal tasks to ensure thorough assessment and validation of model performance.

By pursuing these future directions, we aim to evolve TWLV-I from a specialized video foundation model to a comprehensive visual understanding system capable of handling a wide range of tasks across different modalities. This evolution will not only enhance the model's performance in existing applications but also open up new possibilities in fields such as human-computer interaction, content creation, and advanced AI assistants.

‍

10 - Conclusion

In this comprehensive exploration of video understanding, we have emphasized the critical importance of both appearance and motion as fundamental elements in comprehending visual content. Our research has led to several key outcomes and insights:

Dual-Aspect Evaluation: We have presented methodologies for measuring both appearance and motion capabilities across various tasks, providing a holistic approach to assessing video foundation models.
Limitations of Existing Models: Through rigorous experiments, we have demonstrated that many existing models struggle to excel in both appearance and motion understanding simultaneously, highlighting a significant gap in the field.
Introduction of TWLV-I: In response to these challenges, we have proposed TWLV-I, a novel approach that demonstrates robust performance in both motion and appearance-based tasks.
Versatile Embeddings: The embeddings generated by TWLV-I show promise for wide-ranging applications and further research in various downstream tasks, potentially accelerating progress in video understanding.
Evaluation Framework: The evaluation and analysis methods introduced in this post offer a comprehensive framework for assessing video foundation models, which we hope will be widely adopted in the field.
Future Directions: Our work points towards potential avenues for future research, including scaling up model size, enhancing image embedding capabilities, and expanding into multimodal tasks.
Guiding the Field: By addressing both appearance and motion aspects, our approach aims to set a new standard and direction for the video understanding domain.

In conclusion, TWLV-I represents a significant step forward in creating more versatile and robust video foundation models. By balancing the capabilities of appearance and motion understanding, it opens up new possibilities for advanced video analysis and comprehension. We anticipate that the methodologies and insights presented here will drive further innovations in video AI and inspire researchers to approach video understanding with a more holistic perspective.

As we move forward, the continued development and refinement of such comprehensive models will be crucial in unlocking the full potential of video understanding technologies across various applications, from content analysis and creation to advanced AI assistants and beyond. We invite the research community to build upon these foundations, pushing the boundaries of what's possible in the realm of video AI.

‍

Twelve Labs Team

This work was achieved through the combined efforts (equal contribution) of the core contributors, with significant support from the Twelve Labs ML Research and ML Data teams.

‍

Core Contributors

Hyeongmin Lee, Research Scientist

‍Jin-Young Kim, Research Scientist

‍Kyungjune Baek, Research Scientist

‍Jihwan Kim, Research Scientist

‍Aiden Lee, CTO

‍

Contributors (first-name alphabetical order)

Aaron (Jangwon) Lee, ML Data Intern\

‍Calvin (Minjoon) Seo, Chief Scientist

‍Cooper (Seokjin) Han, Research Scientist

‍Daniel (GeunOh) Kim, ML Data Engineer

‍Flynn (Jiho) Jang, Research Scientist

Ian (Soonwoo) Kwon, Research Scientist

‍Jay Suh, ML Data Engineer

Jay (Jaehyuk) Yi, Research Scientist

‍Jayden (Junwan) Kim, Research Intern

‍Jeff (Jongseok) Kim, Research Scientist

‍Kyle (Seungjoon) Park, Research Scientist

‍Leo (Daewoo) Kim, Research Scientist

‍Mars (Seongsu) Ha, Research Scientist

‍Max (JongMok) Kim, Research Scientist

‍Ray (Raehyuk) Jung, Research Scientist

‍William (Hyojun) Go, Research Scientist

‍

Citation

If you think this project is helpful, please feel free to leave a star and cite our paper:

@inproceedings{twelvelabs2024twlv,
  title={TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models},
  author={Twelve Labs},
  year={2024}
}

Check out the TWLV-I technical report on arXiV and HuggingFace! The code is available on GitHub: https://github.com/twelvelabs-io/video-embeddings-evaluation-framework

‍

TLDR

Comprehensive Evaluation Framework: Twelve Labs introduces a robust evaluation framework for video understanding, emphasizing both appearance and motion analysis.
TWLV-I Model: Our new video foundation model (TWLV-I) excels in balancing these dual aspects, outperforming or matching state-of-the-art models.
Task Performance: TWLV-I shows strong results across action recognition, temporal action localization, and spatio-temporal action localization, highlighting its versatility.
Visualization Insights: t-SNE and LDA visualizations reveal TWLV-I's superior clustering and motion distinguishability capabilities compared to other models.
Future Directions: Emphasizes scaling up, improving image embedding, and expanding modalities to enhance TWLV-I's versatility and applicability.
Guiding Future Research: The proposed methodologies aim to set a new standard in video understanding, guiding future research and development in the field.

‍

1 - Introduction

In today's digital landscape, video serves as a universal language, seamlessly conveying complex ideas and emotions across cultures. Developing robust video understanding systems is crucial for accurately interpreting this rich medium. Videos, as sequences of images, require a dual focus: recognizing the appearance in each frame and understanding the motion that unfolds over time.

At Twelve Labs, we recognize the need for a comprehensive evaluation framework that addresses these dual aspects of video understanding. Our goal is to set a clear direction for future research in this field by establishing methodologies that accurately measure both appearance and motion capabilities.

‍

1.1 - Foundation Models and Video Understanding

Foundation Models (FMs) have revolutionized AI by enabling models to handle diverse tasks within a domain. While Language and Image Foundation Models have made significant strides, video understanding presents unique challenges. Existing Video Foundation Models often fall short in effectively capturing both appearance and motion, as evidenced by limitations in clustering and classification tasks.

‍

1.2 - Introducing TWLV-I and Our Evaluation Framework

To address these challenges, we introduce TWLV-I - designed to excel in both appearance and motion-based tasks, as shown in Figure 1. More importantly, we propose a robust evaluation framework that not only assesses TWLV-I's capabilities but also sets a benchmark for future models.

Our framework includes a range of tasks such as action recognition, temporal action localization, and spatio-temporal action localization, each meticulously designed to evaluate specific aspects of video understanding. Through this comprehensive approach, we aim to highlight the importance of balanced capabilities and guide the field towards more holistic model development.

By focusing on both the performance and proper evaluation of video models, we aspire to present a milestone in video research, paving the way for future advancements and innovations in this dynamic field.

‍

2 - TWLV-I & Video Foundation Model Evaluation Framework

TWLV-I's architecture and training process are designed to address the unique challenges of video understanding, balancing the need for both appearance and motion comprehension. Figure 2 provides a visual comparison of TWLV-I's performance on appearance-centric (Kinetics-400) versus motion-centric (SSv2 and Diving-48) benchmarks. The plot demonstrates TWLV-I's balanced capabilities, handling both types of tasks reasonably well.

This subsection delves into the key aspects of TWLV-I's training methodology and frame sampling techniques.

‍

2.1 - Architecture

TWLV-I is built upon the Vision Transformer (ViT) architecture, leveraging its powerful ability to process visual data. We implement two variants:

ViT-B (Base): A model with 86 million parameters
ViT-L (Large): An expanded version with 307 million parameters

The input video is tokenized into multiple patches, which are then processed through the transformer layers. This process yields patch-wise embeddings that are subsequently pooled to generate the overall embedding for the input video.

2.2 - Pretraining Dataset

To ensure robust and versatile video understanding, TWLV-I is pretrained on a diverse set of datasets, as detailed in Table 1:

Video datasets:
- Kinetics 710 (658K clips)
- HowTo360K (360K clips, a subset of HowTo100M)
- WebVid10M (10.73M clips)
Image datasets (total of 15M images):
- COCO (113K images)
- SBU Captions (860K images)
- Visual Genome (100K images)
- CC3M (2.88M images)
- CC12M (11.00M images)

This combination of video and image datasets enhances TWLV-I's ability to understand both motion dynamics and static visual features.

‍

2.3 - Training Objective

TWLV-I employs a masked modeling schema as its foundational training approach. However, to optimize the model's performance in both motion and appearance understanding, we diversify the reconstruction targets. This strategy aims to create a robust model capable of excelling across various video understanding tasks. The model is trained from scratch using this objective and the aforementioned datasets.

‍

2.4 - Frame Sampling

The frame sampling process is crucial due to the computational constraints of the ViT architecture. As the number of tokens increases, so does the computational complexity (quadratically). To manage this, we employ a strategic frame sampling technique called Multi-Clip Embedding (shown in Figure 4):

Clip Division: An input video is divided into M clips, each T seconds long.
Frame Selection: N frames are sampled from each clip.
Embedding Generation: This process results in M embeddings per video.
Flexible Representation: The number of embeddings increases proportionally with video length, allowing for variable-length video processing.
Single Embedding Option: When a single embedding is required to represent the entire video, we average the M embeddings.

This approach allows TWLV-I to handle videos of varying lengths efficiently while maintaining the ability to capture both short-term and long-term temporal dynamics.

By combining a powerful ViT architecture with a diverse pretraining dataset and an innovative frame sampling technique, TWLV-I is well-equipped to handle the complexities of video understanding tasks. This foundation enables TWLV-I to perform robustly across both appearance-centric and motion-centric benchmarks, as we will explore in subsequent sections of this evaluation framework.

3 - Action Recognition

Action Recognition (AR) is a fundamental task in video understanding, aiming to classify videos into predefined human action categories. This task serves as a crucial benchmark for evaluating the performance of video foundation models, as it requires both appearance and motion understanding.

‍

3.1 - Benchmarks

We evaluate TWLV-I's performance on five representative AR benchmarks, each with its unique characteristics:

Kinetics-400 (K400): A large-scale dataset focusing on appearance-based actions
Something-Something-v2 (SSv2): A motion-centric dataset emphasizing temporal relationships
Moments-in-Time (MiT): Another appearance-focused dataset with diverse action categories
Diving-48 (DV48): A fine-grained, motion-centric dataset of diving actions
Epic-Kitchens (EK): An egocentric dataset capturing daily kitchen activities

These benchmarks provide a comprehensive evaluation landscape, allowing us to assess TWLV-I's capabilities across both appearance-focused and motion-centric scenarios.

‍

3.2 - Evaluation Methodology

We employ a multi-view classification method, which is standard in action recognition tasks:

Spatial resizing of input videos to match the required input resolution
Uniform sampling of m clips along the spatial dimension and n clips along the temporal dimension, resulting in m × n total clips
Calculation of class probabilities for each clip
Averaging of probabilities to obtain the final output

This approach ensures a thorough evaluation of the model's performance across different spatial and temporal segments of the input videos.

‍

3.3 - Linear Probing

Linear probing is a technique used to evaluate the quality of learned representations without fine-tuning the entire model. It involves:

Freezing the feature extractor (backbone model)
Training a linear classifier on top of the frozen features

The linear classifier consists of a weight matrix that maps the embedding vector dimension to the number of action classes.

Results and Analysis

Table 3 presents the linear probing results across various benchmarks and models. Key observations include:

TWLV-I demonstrates strong performance across all benchmarks, particularly within its architecture scale (ViT-B and ViT-L).
For SSv2, TWLV-I outperforms larger models like ViT-H (DFN) and ViT-g (InternVideo2), with the exception of V-JEPA, which includes SSv2 in its pretraining.
TWLV-I's ViT-L variant surpasses larger-scale models on EK and DV48 benchmarks.

These results highlight TWLV-I's ability to learn rich, generalizable features that transfer well to various action recognition tasks, even when evaluated using a simple linear classifier.

‍

3.4 - Attentive Probing

While linear probing provides insights into clip-wise embedding quality, it may not fully capture the models' capabilities, especially for those trained with patch-level supervision. To address this limitation, we introduce attentive probing, which involves:

Training a single attention layer with a learnable class token on top of the frozen model
Passing the output class token through a linear classifier
Measuring the top-1 accuracy

This method allows for a more nuanced evaluation of the models' patch-wise representation capabilities.

Results and Analysis

Table 5 showcases the attentive probing results. Key findings include:

TWLV-I achieves superior performance across both appearance- and motion-centric benchmarks compared to other models.
The strong performance in attentive probing suggests that TWLV-I possesses richer patch-wise representations compared to its counterparts.

‍

3.5 - K-Nearest Neighbors

To provide a parameter-free evaluation of embedding quality and address potential biases in gradient-based methods, we employ the K-Nearest Neighbors (KNN) classification task. This non-parametric approach offers a fair comparison of embedding vectors across different model architectures.

We generate embeddings using two methods:

Uniform Embedding: A single embedding vector for the entire video.
Multi-Clip Embedding: Multiple embeddings from 2-second clips throughout the video.

For Multi-Clip Embedding, we employ two evaluation strategies:

Video-level: Averaging all clip embeddings to create a single video representation.
Clip-level: Summing votes received by each clip to determine the final class.

Results and Analysis

Table 6 presents the KNN classification results across different models and embedding strategies. Key observations include:

TWLV-I demonstrates competitive performance on both Kinetics-400 (K400) and Something-Something-v2 (SSv2) datasets, particularly when compared to models of similar scale.
In the Uniform embedding setting, TWLV-I-ViT-B achieves 57.51% and 19.82% top-1 accuracy on K400 and SSv2 respectively, outperforming UMT_s2 with the same architecture.
TWLV-I-ViT-L shows strong results, with 65.97% and 19.47% top-1 accuracy on K400 and SSv2, surpassing several larger models.

However, unlike the attentive probing results, TWLV-I underperforms compared to InternVideo2 on both K400 and SSv2 in the KNN evaluation. This suggests:

There's room for improvement in utilizing TWLV-I's embeddings in a non-parametric manner.
The need for further research on effectively representing embeddings for longer videos, especially those extending beyond a single clip length.

These results highlight the complexity of video representation and the challenges in creating a universally strong embedding across different evaluation methodologies.

‍

3.6 - Pre-Training with SSv2

To ensure a fair comparison with models like V-JEPA and to assess the impact of including motion-centric data in pretraining, we conducted an additional experiment by incorporating the Something-Something-v2 (SSv2) dataset into TWLV-I's pretraining phase.

We trained a variant of TWLV-I (ViT-L architecture) including SSv2 in its pretraining dataset alongside the original datasets. This allows us to directly compare the impact of this additional data on various benchmarks and evaluation methods.

‍

Results and Analysis

The results of this SSv2-enhanced model are presented at the bottom of Tables 3, 5, and 6. Key findings include:

Improved SSv2 Performance: Across all evaluation methods (linear probing, attentive probing, and KNN), the model shows significant improvements on the SSv2 benchmark. For example, in linear probing (Table 3), the top-1 accuracy on SSv2 increases from 46.41% to 48.14%.
General Performance Boost: The inclusion of SSv2 in pretraining leads to performance gains across other benchmarks as well. This suggests that the additional motion-centric data enhances the model's overall video understanding capabilities.
Comparison with Specialized Models: Despite the improvements, the KNN performance of our SSv2-enhanced model still lags behind V-JEPA and InternVideo2 on SSv2. This indicates that while TWLV-I's patch-level representations are strong, there's room for improvement in video-level representations.
Trade-offs in Representation: The results highlight a potential imbalance between TWLV-I's patch-level and video-level representations. Addressing this discrepancy should be a key focus for future enhancements to the model.

These findings underscore the importance of diverse pretraining data in developing robust video foundation models. They also point to specific areas where TWLV-I can be improved to better compete with specialized models on motion-centric tasks while maintaining its strong performance on appearance-based benchmarks.

‍

4 - ImageNet Classification

To assess the versatility of video foundation models and their potential to function as general vision models, we evaluated TWLV-I's performance on the ImageNet classification task. This benchmark provides insights into the model's ability to process and understand static images, which is crucial for a comprehensive video understanding system.

‍

4.1 - Results and Analysis

The ImageNet classification results are presented in the last columns of Table 3 (linear probing) and Table 5 (attentive probing). Key observations include:

Correlation with Appearance-Centric Tasks: Generally, models that excel in appearance-centric action recognition tasks (e.g., Kinetics-400) also demonstrate strong performance on the ImageNet benchmark. This correlation suggests that the features learned for video understanding can transfer effectively to static image classification.
TWLV-I's Performance:
- In linear probing (Table 3), TWLV-I-ViT-L achieves 72.98% Top-1 accuracy on ImageNet, which is competitive but not leading.
- For attentive probing (Table 5), TWLV-I-ViT-L shows improved performance with 79.19% Top-1 accuracy, indicating better utilization of learned features through attention mechanisms.
Comparison with Specialized Models: While TWLV-I shows strong performance on video-based tasks, there's a noticeable gap in ImageNet classification compared to some specialized models. For instance, InternVideo2 significantly outperforms TWLV-I on ImageNet, although this gap narrows or even reverses in video action recognition tasks.

‍

4.2 - Implications and Insights

Given such observations, we deduce the following insights:

Static Image Processing Limitations: The results suggest that TWLV-I has some limitations in processing static images compared to models optimized for this task. This indicates an area for potential improvement in future iterations of the model.
Motion Information Utilization: The narrowing performance gap between TWLV-I and other models in video tasks, despite lower ImageNet scores, suggests that TWLV-I effectively leverages motion information for video understanding. This ability complements its appearance-based features, resulting in strong overall video analysis performance.
Balancing Video and Image Capabilities: To evolve TWLV-I beyond being solely a video model, future research should focus on enhancing its ability to understand single images without compromising its strong video analysis capabilities.

Overall, while TWLV-I demonstrates strong capabilities in video understanding tasks, the ImageNet classification results highlight areas for improvement in static image processing. Addressing these limitations could lead to a more comprehensive visual foundation model capable of excelling across both video and image domains.

‍

5 - Temporal Action Localization

Temporal Action Localization (TAL) is a critical task in video understanding that involves identifying and localizing specific actions within untrimmed videos. This task is particularly important for applications in autonomous driving, sports analysis, and content-based video retrieval. TAL requires models to analyze long, complex videos and precisely determine both the temporal boundaries and class labels of distinct actions occurring within them.

‍

5.1 - Evaluation Perspectives

TAL evaluates video foundation models from two key perspectives:

Temporal Sensitivity: The ability to identify if an action of interest occurs at a specific time step.
Instance Discrimination: The capability to distinguish or group frame-wise segments into complete action instances.

While TAL is primarily designed as a motion-centric task, our analysis reveals that both appearance and motion capabilities contribute to achieving these aspects effectively.

‍

5.2 - Methodology

We evaluated TWLV-I and other video foundation models on two prominent TAL datasets:

ActivityNet-v1.3
THUMOS14

For the detection head, we employed ActionFormer. Our evaluation used two distinct methods:

Self-contained: The model performs both classification and regression without external assistance.
w/ External Classifier: The model conducts binary classification, while an external classifier predicts the action class.

Feature extraction followed the Multi-Clip Embedding method described in the "Frame Sampling" section.

‍

5.3 - Results and Analysis

Tables 9 and 10 present the comprehensive results for THUMOS14 and ActivityNet-v1.3, respectively.

THUMOS14 (Table 9)

TWLV-I (our model) consistently outperforms other models of the same scale across all metrics.
TWLV-I-ViT-L achieves the highest average mAP of 58.75% in the self-contained setting and 53.63% with an external classifier.
Notably, TWLV-I-ViT-L surpasses larger models like DFN and V-JEPA (ViT-H), demonstrating its robust generalization capabilities.

ActivityNet-v1.3 (Table 10)

TWLV-I again shows superior performance within its architecture scale.
TWLV-I-ViT-L achieves an average mAP of 34.98% (self-contained) and 39.49% (w/ external classifier).
Remarkably, TWLV-I performs exceptionally well at strict IoU thresholds (e.g., 0.95), indicating strong temporal sensitivity.

‍

5.4 - Key Insights

Given such observations, we deduce the following insights:

Scale Efficiency: TWLV-I's performance, especially at the ViT-L scale, outperforming larger models, highlights its efficient architecture and training methodology.
Temporal Precision: High performance at strict IoU thresholds (e.g., 0.95 in ActivityNet) demonstrates TWLV-I's superior temporal sensitivity.
Model Scale Impact: InternVideo2's top performance, particularly in self-contained evaluation, suggests that TAL benefits from larger model scales, especially when performing both classification and boundary regression simultaneously.
Evaluation Strategy: For a focused assessment of motion understanding, using an external classifier is recommended. However, for a comprehensive evaluation including appearance capabilities, the self-contained approach is preferable.
Generalization Capability: TWLV-I's strong performance across different datasets and evaluation methods indicates its robustness as a backbone for temporal localization tasks.

These results underscore TWLV-I's effectiveness in tackling the complex task of temporal action localization, showcasing its balanced capabilities in both temporal sensitivity and instance discrimination. The model's performance across various scales and evaluation methods demonstrates its potential as a versatile foundation for video understanding tasks.

‍

6 - Spatio-Temporal Action Localization

Spatio-Temporal Action Localization (STAL) is a complex task that evaluates a model's ability to not only recognize actions but also accurately localize them in both space and time within untrimmed videos. This task is crucial for applications requiring detailed action analysis, as it demands a comprehensive understanding of both appearance and motion.

‍

6.1 - Dataset and Evaluation

We evaluate models using the AVA v2.2 dataset, which consists of 430 movie clips, each 15 minutes long. Keyframes are annotated every second, resulting in 210,634 labeled frames in the training set and 57,371 in the validation set. The dataset includes 80 atomic actions labeled for each actor.

For evaluation, we report the Frame Average Precision (fAP) at an Intersection over Union (IoU) threshold of 0.5, using the latest v2.2 annotations. We adopt an end-to-end STAL framework, training the decoder while keeping the vision backbone frozen.

‍

6.2 - Results and Analysis

Table 11 presents the evaluation results for the STAL task. Key observations include:

Comparison with Other Models: TWLV-I, along with UMT and InternVideo2, outperforms DFN and V-JEPA. While DFN and V-JEPA exhibit opposite trends in other tasks, they show similar performance in STAL due to the dual requirements of instance localization and action recognition.
Challenges in Instance Localization:
- V-JEPA: Struggles with instance localization, which impairs its ability to perform action recognition effectively.
- DFN: Excels in localizing instances due to its strong appearance understanding but fails in action recognition due to limited temporal understanding.
Strengths of TWLV-I:
- TWLV-I demonstrates a balanced understanding of both appearance and motion, allowing it to perform well in STAL.
- At the ViT-L scale, TWLV-I achieves a fAP@0.5 of 27.39, outperforming both DFN and V-JEPA and closely matching the performance of larger models like InternVideo2.

‍

6.3 - Key Insights

Given such observations, we deduce the following insights:

Balanced Understanding: The results confirm that TWLV-I effectively understands spatio-temporal information, crucial for accurate STAL.
Comprehensive Evaluation: The STAL task, conducted in an end-to-end manner, serves as a robust measure of a model's ability to understand video comprehensively, encompassing both spatial and temporal dimensions.
Model Validation: The performance of TWLV-I, alongside UMT and InternVideo2, highlights its capability to handle the dual challenges of localization and recognition, setting it apart from models with a singular focus on appearance or motion.

These findings underscore the importance of a balanced approach to video understanding, where both appearance and motion are integrated to achieve superior performance in complex tasks like STAL.

‍

7 - Temporal Action Segmentation

Temporal Action Segmentation (TAS) is a crucial task for understanding and analyzing human activities in complex, extended videos. It has a wide range of applications, including video surveillance, summarization, and skill assessment. TAS involves processing untrimmed video inputs to generate a sequence of action class labels for each frame.

‍

7.1 - Benchmarks and Methodology

We evaluate TAS using three challenging benchmarks:

50Salads: Focuses on fine-grained actions in cooking scenarios.
GTEA: Captures egocentric views of daily activities.
Breakfast: Encompasses a broader view, including full-body actions.

For TAS, we employ ASFormer as the detection head. We report the average F1 scores at thresholds of 10, 25, and 50, referred to as ‘mF1’. Features are extracted using the Multi-Clip Embedding method and aggregated with spatial pooling to retain only the temporal axis.

‍

7.2 - Results and Analysis

Table 12 presents the TAS evaluation results across the three benchmarks:

50Salads: TWLV-I achieves an mF1 score of 80.69 with ViT-B and 80.60 with ViT-L, outperforming both appearance-centric models like UMT and DFN and the motion-centric model V-JEPA. The Edit and accuracy scores also reflect TWLV-I's strong alignment with ground-truth sequences.
GTEA: TWLV-I excels with an mF1 score of 88.26 (ViT-B) and 88.43 (ViT-L), surpassing all baselines. This demonstrates its robust performance in egocentric views, where precise action segmentation is critical.
Breakfast: Despite the broader view, TWLV-I maintains strong performance, achieving mF1 scores of 52.18 (ViT-B) and 50.66 (ViT-L). This highlights its versatility across different camera perspectives.

‍

7.3 - Key Insights

Given such observations, we deduce the following insights:

Balanced Capabilities: TWLV-I's superior performance across all datasets indicates a well-balanced understanding of both appearance and motion, crucial for accurate action segmentation.
Alignment with Ground Truth: High Edit and mF1 scores suggest that TWLV-I's predictions closely match the actual sequences of actions, enhancing its reliability for real-world applications.
Versatility Across Views: The model's ability to outperform baselines in various camera views, including top-down and egocentric perspectives, underscores its adaptability and robustness.
Comparison with InternVideo2: While TWLV-I performs comparably to InternVideo2 in specific views, its consistent excellence across different datasets and perspectives highlights its comprehensive capabilities.

These results affirm the effectiveness of our model in tackling the complex task of temporal action segmentation, showcasing its potential for diverse applications in video analysis.

‍

8 - Embedding Visualization

To gain deeper insights into the representational capabilities of TWLV-I and other video foundation models, we conducted visualization experiments using t-SNE and Linear Discriminant Analysis (LDA). These visualizations provide valuable information about the models' ability to capture both appearance-based and motion-based features.

‍

8.1 - t-SNE Visualization on Benchmark Datasets

We used t-SNE to visualize the embedding spaces of the Kinetics-400 (K400) and Something-Something-v2 (SSv2) validation sets, with different colors representing different action classes.

Kinetics-400 (Figure 6)

TWLV-I, UMT, and InternVideo2: These models demonstrate clear clustering of samples, with distinct decision boundaries for most classes. This indicates a strong ability to differentiate between various appearance-based actions.
V-JEPA: In contrast, V-JEPA fails to cluster the samples into distinct groups, suggesting limitations in capturing appearance-based features effectively.

Something-Something-v2 (Figure 7)

All Models: Interestingly, none of the models, including TWLV-I and V-JEPA, show effective clustering of classes in the SSv2 dataset. This aligns with the generally lower classification performance on SSv2 compared to K400.
Implications: The lack of clear clustering in SSv2 visualizations highlights the challenges in representing motion-centric actions and underscores the need for further research to improve performance on such benchmarks.

‍

8.2 - Directional Motion Distinguishability

To better understand the models' capabilities in capturing motion information, we focused on specific classes within SSv2 that require understanding directional motion, such as "Moving something up" and "Moving something down".

We visualized embeddings of original videos and their temporally flipped versions using Linear Discriminant Analysis (LDA). This approach allows us to assess whether models can encode directional motion information in their embedding vectors, independent of appearance.

Results (Figures 8 and 9)

TWLV-I: Demonstrates good separation between forward and reverse videos, indicating effective encoding of directional motion information.
V-JEPA: Shows the highest distinguishability between forward and reverse videos, likely due to its pretraining on the SSv2 dataset.
UMT and InternVideo2: Provide non-separable embeddings for forward and reverse videos, suggesting a limitation in capturing directional motion information despite their strong appearance understanding capabilities.

8.3 - Key Insights

Given such observations, we deduce the following insights:

Appearance vs. Motion Understanding: TWLV-I shows a balanced capability in capturing both appearance-based features (evident in K400 visualization) and motion-based features (demonstrated in directional motion analysis).
Challenges in Motion-Centric Tasks: The difficulty in clustering SSv2 classes highlights the complexity of representing motion-centric actions, a challenge faced by all evaluated models.
Directional Motion Encoding: TWLV-I's ability to distinguish between forward and reverse videos indicates its effectiveness in encoding fine-grained motion information, a crucial aspect for comprehensive video understanding.
Model Specialization: The results reveal the strengths and limitations of different models, with some excelling in appearance-based tasks while struggling with motion-centric challenges.

These visualizations provide valuable insights into the representational capabilities of TWLV-I and other video foundation models, highlighting areas of strength and identifying opportunities for further improvement in capturing complex motion dynamics.

‍

9 - Future Directions

As we look ahead, there are several promising avenues for further development and enhancement of TWLV-I and video foundation models in general. These directions aim to expand the model's capabilities, improve its performance, and broaden its applicability across various domains of visual understanding.

‍

9.1 - Scaling Up

Our current focus has been on two primary scales of models: ViT-B and ViT-L. The ViT-L variant of TWLV-I has already demonstrated performance comparable to, and in some cases surpassing, larger models such as ViT-H or ViT-g. This success suggests significant potential for further improvements through scaling:

Increased Model Size: Expanding TWLV-I to even larger architectures could potentially yield substantial performance gains across various tasks.
Large-Scale Datasets: Utilizing extensive, in-house collected datasets for training could lead to the development of more general and robust models. This approach could help in capturing a wider range of visual concepts and scenarios.
Efficient Scaling Techniques: Exploring methods to scale the model efficiently, such as sparse attention mechanisms or mixture-of-experts architectures, could allow for increased model capacity without proportional increases in computational requirements.

‍

9.2 - Enhancing Image Embedding Capabilities

While TWLV-I has shown strong performance in video-related tasks, there's room for improvement in its image embedding capabilities:

Single-Frame Processing: Enhancing TWLV-I's ability to process and understand single-frame inputs (i.e., images) is crucial. This improvement would bridge the gap between video and image understanding, making TWLV-I a more versatile visual foundation model.
Transfer Learning: Developing techniques to effectively transfer knowledge from video understanding to image tasks could lead to improved performance in image-specific applications.
Unified Architecture: Exploring architectural modifications that allow for seamless processing of both videos and images within the same model framework.

‍

9.3 - Expanding Modality

To further enhance TWLV-I's versatility and align it with broader applications in AI, expanding its modality is a key direction:

Multimodal Tasks: Extending TWLV-I to handle multimodal tasks such as video-text retrieval and video captioning. This expansion would require aligning the model with text domains while maintaining its strong unimodal video understanding capabilities.
Video-Language Models: Developing TWLV-I as a more effective vision encoder for video-language models, enabling better integration with language processing tasks.
Audio Integration: Incorporating audio processing capabilities to enable tasks such as audio-visual synchronization and sound event detection in videos.
Evaluation Methodologies: Proposing and standardizing comprehensive evaluation methodologies for these multimodal tasks to ensure thorough assessment and validation of model performance.

By pursuing these future directions, we aim to evolve TWLV-I from a specialized video foundation model to a comprehensive visual understanding system capable of handling a wide range of tasks across different modalities. This evolution will not only enhance the model's performance in existing applications but also open up new possibilities in fields such as human-computer interaction, content creation, and advanced AI assistants.

‍

10 - Conclusion

In this comprehensive exploration of video understanding, we have emphasized the critical importance of both appearance and motion as fundamental elements in comprehending visual content. Our research has led to several key outcomes and insights:

Dual-Aspect Evaluation: We have presented methodologies for measuring both appearance and motion capabilities across various tasks, providing a holistic approach to assessing video foundation models.
Limitations of Existing Models: Through rigorous experiments, we have demonstrated that many existing models struggle to excel in both appearance and motion understanding simultaneously, highlighting a significant gap in the field.
Introduction of TWLV-I: In response to these challenges, we have proposed TWLV-I, a novel approach that demonstrates robust performance in both motion and appearance-based tasks.
Versatile Embeddings: The embeddings generated by TWLV-I show promise for wide-ranging applications and further research in various downstream tasks, potentially accelerating progress in video understanding.
Evaluation Framework: The evaluation and analysis methods introduced in this post offer a comprehensive framework for assessing video foundation models, which we hope will be widely adopted in the field.
Future Directions: Our work points towards potential avenues for future research, including scaling up model size, enhancing image embedding capabilities, and expanding into multimodal tasks.
Guiding the Field: By addressing both appearance and motion aspects, our approach aims to set a new standard and direction for the video understanding domain.

In conclusion, TWLV-I represents a significant step forward in creating more versatile and robust video foundation models. By balancing the capabilities of appearance and motion understanding, it opens up new possibilities for advanced video analysis and comprehension. We anticipate that the methodologies and insights presented here will drive further innovations in video AI and inspire researchers to approach video understanding with a more holistic perspective.

As we move forward, the continued development and refinement of such comprehensive models will be crucial in unlocking the full potential of video understanding technologies across various applications, from content analysis and creation to advanced AI assistants and beyond. We invite the research community to build upon these foundations, pushing the boundaries of what's possible in the realm of video AI.

‍

Twelve Labs Team

This work was achieved through the combined efforts (equal contribution) of the core contributors, with significant support from the Twelve Labs ML Research and ML Data teams.

‍

Core Contributors

Hyeongmin Lee, Research Scientist

‍Jin-Young Kim, Research Scientist

‍Kyungjune Baek, Research Scientist

‍Jihwan Kim, Research Scientist

‍Aiden Lee, CTO

‍

Contributors (first-name alphabetical order)

Aaron (Jangwon) Lee, ML Data Intern\

‍Calvin (Minjoon) Seo, Chief Scientist

‍Cooper (Seokjin) Han, Research Scientist

‍Daniel (GeunOh) Kim, ML Data Engineer

‍Flynn (Jiho) Jang, Research Scientist

Ian (Soonwoo) Kwon, Research Scientist

‍Jay Suh, ML Data Engineer

Jay (Jaehyuk) Yi, Research Scientist

‍Jayden (Junwan) Kim, Research Intern

‍Jeff (Jongseok) Kim, Research Scientist

‍Kyle (Seungjoon) Park, Research Scientist

‍Leo (Daewoo) Kim, Research Scientist

‍Mars (Seongsu) Ha, Research Scientist

‍Max (JongMok) Kim, Research Scientist

‍Ray (Raehyuk) Jung, Research Scientist

‍William (Hyojun) Go, Research Scientist

‍

Citation

If you think this project is helpful, please feel free to leave a star and cite our paper:

@inproceedings{twelvelabs2024twlv,
  title={TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models},
  author={Twelve Labs},
  year={2024}
}