Table of Contents
Fetching ...

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Hyeongmin Lee, Jin-Young Kim, Kyungjune Baek, Jihwan Kim, Hyojun Go, Seongsu Ha, Seokjin Han, Jiho Jang, Raehyuk Jung, Daewoo Kim, GeunOh Kim, JongMok Kim, Jongseok Kim, Junwan Kim, Soonwoo Kwon, Jangwon Lee, Seungjoon Park, Minjoon Seo, Jay Suh, Jaehyuk Yi, Aiden Lee

TL;DR

This work tackles the challenge of fairly evaluating video foundation models whose performance is sensitive to sampling and pretraining choices. It proposes a holistic evaluation framework that separately measures appearance understanding and motion understanding, and introduces TWLV-I, a video foundation model trained on publicly accessible data. Across action recognition, temporal localization, spatio-temporal localization, and segmentation benchmarks, TWLV-I demonstrates robust performance, often matching or surpassing state-of-the-art baselines at the same scale and sometimes beating larger models. The paper also provides embedding vectors and an open-source evaluation framework to facilitate broader benchmarking and future developments in video-language and video understanding tasks.

Abstract

In this work, we discuss evaluating video foundation models in a fair and robust manner. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. Based on the average top-1 accuracy of linear probing on five action recognition benchmarks, pretrained only on publicly accessible datasets, our model shows a 4.6%p improvement compared to V-JEPA (ViT-L) and a 7.7%p improvement compared to UMT (ViT-L). Even when compared to much larger models, our model demonstrates a 7.2%p improvement compared to DFN (ViT-H), a 2.7%p improvement compared to V-JEPA (ViT-H) and a 2.8%p improvement compared to InternVideo2 (ViT-g). We provide embedding vectors obtained by TWLV-I from videos of several commonly used video benchmarks, along with evaluation source code that can directly utilize these embeddings. The code is available at https://github.com/twelvelabs-io/video-embeddings-evaluation-framework.

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

TL;DR

This work tackles the challenge of fairly evaluating video foundation models whose performance is sensitive to sampling and pretraining choices. It proposes a holistic evaluation framework that separately measures appearance understanding and motion understanding, and introduces TWLV-I, a video foundation model trained on publicly accessible data. Across action recognition, temporal localization, spatio-temporal localization, and segmentation benchmarks, TWLV-I demonstrates robust performance, often matching or surpassing state-of-the-art baselines at the same scale and sometimes beating larger models. The paper also provides embedding vectors and an open-source evaluation framework to facilitate broader benchmarking and future developments in video-language and video understanding tasks.

Abstract

In this work, we discuss evaluating video foundation models in a fair and robust manner. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. Based on the average top-1 accuracy of linear probing on five action recognition benchmarks, pretrained only on publicly accessible datasets, our model shows a 4.6%p improvement compared to V-JEPA (ViT-L) and a 7.7%p improvement compared to UMT (ViT-L). Even when compared to much larger models, our model demonstrates a 7.2%p improvement compared to DFN (ViT-H), a 2.7%p improvement compared to V-JEPA (ViT-H) and a 2.8%p improvement compared to InternVideo2 (ViT-g). We provide embedding vectors obtained by TWLV-I from videos of several commonly used video benchmarks, along with evaluation source code that can directly utilize these embeddings. The code is available at https://github.com/twelvelabs-io/video-embeddings-evaluation-framework.
Paper Structure (17 sections, 9 figures, 12 tables)

This paper contains 17 sections, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Comparison with video foundation models in the same scale. TWLV-I can transfer to various tasks and perform comparable or even superior to the state-of-the-art models. We present the best performance among the competitors, as well as the performance of our model. † denotes the model that uses the dataset in the pretraining stage. All compared models in this figure are on the ViT-L scale.
  • Figure 2: Performance on appearance- vs. motion-centric benchmarks. Our model can handle both appearance- and motion-centric benchmarks reasonably well. † denotes that the pretraining dataset of the model includes the downstream dataset. V-JEPA uses Something-Something-v2 in the pretraining stage. InternVideo2 is pretrained on Moments-in-Time and Something-Something-v2.
  • Figure 3: (a), (b): t-SNE visualizations of embeddings obtained from the K400 validation set using TWLV-I and V-JEPA. (c), (d): LDA visualizations of embeddings from the 'Moving something up' class of the SSv2 validation set and their reversed versions using TWLV-I and InternVideo2. Details can be found in Figures \ref{['fig:vis_k400']}, \ref{['fig:updown_down']}, \ref{['fig:updown_up']}, and Section \ref{['sec:embedding_vis']}. As seen from (a) and (b), V-JEPA lacks the capability to cluster embeddings of the same class when extracting embeddings from K400, where understanding the visual appearance of each frame is important. From (c) and (d), it is evident that InternVideo2 struggles to distinguish between videos with objects moving in a specific direction and their reversed versions, indicating a limitation in motion understanding capability. In contrast, TWLV-I demonstrates both of these capabilities.
  • Figure 4: Overall architecture of the evaluation framework including TWLV-I. In Multi-Clip Embedding, the video is divided into multiple clips, and an embedding is produced from each clip. These clip-level embeddings are either all passed to the downstream task or averaged before being passed to the task head.
  • Figure 5: Visualization of temporal action segmentation. The figure shows the qualitative results of two test samples from GTEA. Each action class corresponds to the different color and the x-axis represents time.
  • ...and 4 more figures