A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
Akash Kumar, Ashlesha Kumar, Vibhav Vineet, Yogesh S Rawat
TL;DR
This paper introduces a standardized benchmark for self-supervised video representation learning and conducts a comprehensive, large-scale analysis across dataset size, task complexity, distribution shifts, noise robustness, and feature representations. By evaluating six SSL methods over six architectures on five datasets and two downstream tasks, it uncovers how pretext tasks and model capacity interact with data properties. Key findings show that spatio-temporal pretext tasks generalize better under distribution shifts, contrastive methods converge faster but are more sensitive to noise, and knowledge distillation can yield high performance with reduced data and training time. The work extends insights to Video Foundation Models and demonstrates practical gains, including state-of-the-art action recognition with substantially less pretraining data, offering actionable guidelines for designing robust, scalable SSL pipelines in video understanding.
Abstract
Self-supervised learning has emerged as a powerful paradigm for label-free model pretraining, particularly in the video domain, where manual annotation is costly and time-intensive. However, existing self-supervised approaches employ diverse experimental setups, making direct comparisons challenging due to the absence of a standardized benchmark. In this work, we establish a unified benchmark that enables fair comparisons across different methods. Additionally, we systematically investigate five critical aspects of self-supervised learning in videos: (1) dataset size, (2) model complexity, (3) data distribution, (4) data noise, and (5) feature representations. To facilitate this study, we evaluate six self-supervised learning methods across six network architectures, conducting extensive experiments on five benchmark datasets and assessing performance on two distinct downstream tasks. Our analysis reveals key insights into the interplay between pretraining strategies, dataset characteristics, pretext tasks, and model architectures. Furthermore, we extend these findings to Video Foundation Models (ViFMs), demonstrating their relevance in large-scale video representation learning. Finally, leveraging these insights, we propose a novel approach that significantly reduces training data requirements while surpassing state-of-the-art methods that rely on 10% more pretraining data. We believe this work will guide future research toward a deeper understanding of self-supervised video representation learning and its broader implications.
