SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning
Fida Mohammad Thoker, Letian Jiang, Chen Zhao, Piyush Bagad, Hazel Doughty, Bernard Ghanem, Cees G. M. Snoek
TL;DR
SEVERE++ provides a comprehensive, benchmark-driven assessment of generalization in video representation learning by evaluating CNNs, video-only transformers, and video-text transformers across four downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Extending prior SEVERE work to transformers, the study analyzes over 1100 experiments across 8 datasets and 7 tasks, revealing that no method consistently generalizes across all factors and that transformer gains are not universal. Key findings include stronger domain-shift robustness for video-only transformers, superior fine-grained action performance for CNNs, and underwhelming performance of large-scale video-text pretraining in several settings, despite substantial data. The SEVERE-benchmark++ framework and its subset recommendations offer a practical, unified protocol to benchmark future video SSL methods for robust transfer and real-world applicability. Together, these results highlight the need for diversified benchmarks and targeted pretraining strategies (e.g., motion-aware masking, high-level semantic reconstruction) to improve generalization in video representation learning.
Abstract
Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kinetics-400 and fine-tuning on similar datasets, limiting our understanding of their generalization in real world scenarios. In this work, we present a comprehensive evaluation of modern video self-supervised models, focusing on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Building on our prior work analyzing benchmark sensitivity in CNN-based contrastive learning, we extend the study to cover state-of-the-art transformer-based video-only and video-text models. Specifically, we benchmark 12 transformer-based methods (7 video-only, 5 video-text) and compare them to 10 CNN-based methods, totaling over 1100 experiments across 8 datasets and 7 downstream tasks. Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions. No method generalizes consistently across all factors, video-only transformers perform better under domain shifts, CNNs outperform for fine-grained tasks, and video-text models often underperform despite large scale pretraining. We also find that recent transformer models do not consistently outperform earlier approaches. Our findings provide a detailed view of the strengths and limitations of current video SSL methods and offer a unified benchmark for evaluating generalization in video representation learning.
