Table of Contents
Fetching ...

SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning

Fida Mohammad Thoker, Letian Jiang, Chen Zhao, Piyush Bagad, Hazel Doughty, Bernard Ghanem, Cees G. M. Snoek

TL;DR

SEVERE++ provides a comprehensive, benchmark-driven assessment of generalization in video representation learning by evaluating CNNs, video-only transformers, and video-text transformers across four downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Extending prior SEVERE work to transformers, the study analyzes over 1100 experiments across 8 datasets and 7 tasks, revealing that no method consistently generalizes across all factors and that transformer gains are not universal. Key findings include stronger domain-shift robustness for video-only transformers, superior fine-grained action performance for CNNs, and underwhelming performance of large-scale video-text pretraining in several settings, despite substantial data. The SEVERE-benchmark++ framework and its subset recommendations offer a practical, unified protocol to benchmark future video SSL methods for robust transfer and real-world applicability. Together, these results highlight the need for diversified benchmarks and targeted pretraining strategies (e.g., motion-aware masking, high-level semantic reconstruction) to improve generalization in video representation learning.

Abstract

Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kinetics-400 and fine-tuning on similar datasets, limiting our understanding of their generalization in real world scenarios. In this work, we present a comprehensive evaluation of modern video self-supervised models, focusing on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Building on our prior work analyzing benchmark sensitivity in CNN-based contrastive learning, we extend the study to cover state-of-the-art transformer-based video-only and video-text models. Specifically, we benchmark 12 transformer-based methods (7 video-only, 5 video-text) and compare them to 10 CNN-based methods, totaling over 1100 experiments across 8 datasets and 7 downstream tasks. Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions. No method generalizes consistently across all factors, video-only transformers perform better under domain shifts, CNNs outperform for fine-grained tasks, and video-text models often underperform despite large scale pretraining. We also find that recent transformer models do not consistently outperform earlier approaches. Our findings provide a detailed view of the strengths and limitations of current video SSL methods and offer a unified benchmark for evaluating generalization in video representation learning.

SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning

TL;DR

SEVERE++ provides a comprehensive, benchmark-driven assessment of generalization in video representation learning by evaluating CNNs, video-only transformers, and video-text transformers across four downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Extending prior SEVERE work to transformers, the study analyzes over 1100 experiments across 8 datasets and 7 tasks, revealing that no method consistently generalizes across all factors and that transformer gains are not universal. Key findings include stronger domain-shift robustness for video-only transformers, superior fine-grained action performance for CNNs, and underwhelming performance of large-scale video-text pretraining in several settings, despite substantial data. The SEVERE-benchmark++ framework and its subset recommendations offer a practical, unified protocol to benchmark future video SSL methods for robust transfer and real-world applicability. Together, these results highlight the need for diversified benchmarks and targeted pretraining strategies (e.g., motion-aware masking, high-level semantic reconstruction) to improve generalization in video representation learning.

Abstract

Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kinetics-400 and fine-tuning on similar datasets, limiting our understanding of their generalization in real world scenarios. In this work, we present a comprehensive evaluation of modern video self-supervised models, focusing on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Building on our prior work analyzing benchmark sensitivity in CNN-based contrastive learning, we extend the study to cover state-of-the-art transformer-based video-only and video-text models. Specifically, we benchmark 12 transformer-based methods (7 video-only, 5 video-text) and compare them to 10 CNN-based methods, totaling over 1100 experiments across 8 datasets and 7 downstream tasks. Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions. No method generalizes consistently across all factors, video-only transformers perform better under domain shifts, CNNs outperform for fine-grained tasks, and video-text models often underperform despite large scale pretraining. We also find that recent transformer models do not consistently outperform earlier approaches. Our findings provide a detailed view of the strengths and limitations of current video SSL methods and offer a unified benchmark for evaluating generalization in video representation learning.

Paper Structure

This paper contains 32 sections, 1 equation, 4 figures, 18 tables.

Figures (4)

  • Figure 1: Benchmark-sensitivity. We evaluate the sensitivity of 10 CNN-based video SSL methods, 7 transformer-based video-only SSL and 5 transformer-based video-text pre-training methods for 4 downstream factors. The downstream factors vary from the pre-training source in: the domain, the samples, the actions and the task.
  • Figure 2: Video dataset characteristics. Characterizing domain shift in datasets via difference in label overlap, point-of-view (PoV), environment, action length and temporal awareness with Kinetics-400 (shown by dotted line). Kinetics-400 and UCF-101 are highly similar to each other, while datasets like Something-Something-v2, EPIC-Kitchens-100 and Charades have different attributes compared to Kinetics-400.
  • Figure 3: Example video frames from the Kinetics-400 pre-training dataset and some downstream datasets we consider. Note the differences in the capture settings and point-of-view across these datasets.
  • Figure 4: Temporal awareness. Illustrating the effect of temporal awareness (increasing temporal-context) on the action recognition performance using a standard 3D-CNN for different action datasets.