Table of Contents
Fetching ...

A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning

Akash Kumar, Ashlesha Kumar, Vibhav Vineet, Yogesh S Rawat

TL;DR

This paper introduces a standardized benchmark for self-supervised video representation learning and conducts a comprehensive, large-scale analysis across dataset size, task complexity, distribution shifts, noise robustness, and feature representations. By evaluating six SSL methods over six architectures on five datasets and two downstream tasks, it uncovers how pretext tasks and model capacity interact with data properties. Key findings show that spatio-temporal pretext tasks generalize better under distribution shifts, contrastive methods converge faster but are more sensitive to noise, and knowledge distillation can yield high performance with reduced data and training time. The work extends insights to Video Foundation Models and demonstrates practical gains, including state-of-the-art action recognition with substantially less pretraining data, offering actionable guidelines for designing robust, scalable SSL pipelines in video understanding.

Abstract

Self-supervised learning has emerged as a powerful paradigm for label-free model pretraining, particularly in the video domain, where manual annotation is costly and time-intensive. However, existing self-supervised approaches employ diverse experimental setups, making direct comparisons challenging due to the absence of a standardized benchmark. In this work, we establish a unified benchmark that enables fair comparisons across different methods. Additionally, we systematically investigate five critical aspects of self-supervised learning in videos: (1) dataset size, (2) model complexity, (3) data distribution, (4) data noise, and (5) feature representations. To facilitate this study, we evaluate six self-supervised learning methods across six network architectures, conducting extensive experiments on five benchmark datasets and assessing performance on two distinct downstream tasks. Our analysis reveals key insights into the interplay between pretraining strategies, dataset characteristics, pretext tasks, and model architectures. Furthermore, we extend these findings to Video Foundation Models (ViFMs), demonstrating their relevance in large-scale video representation learning. Finally, leveraging these insights, we propose a novel approach that significantly reduces training data requirements while surpassing state-of-the-art methods that rely on 10% more pretraining data. We believe this work will guide future research toward a deeper understanding of self-supervised video representation learning and its broader implications.

A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning

TL;DR

This paper introduces a standardized benchmark for self-supervised video representation learning and conducts a comprehensive, large-scale analysis across dataset size, task complexity, distribution shifts, noise robustness, and feature representations. By evaluating six SSL methods over six architectures on five datasets and two downstream tasks, it uncovers how pretext tasks and model capacity interact with data properties. Key findings show that spatio-temporal pretext tasks generalize better under distribution shifts, contrastive methods converge faster but are more sensitive to noise, and knowledge distillation can yield high performance with reduced data and training time. The work extends insights to Video Foundation Models and demonstrates practical gains, including state-of-the-art action recognition with substantially less pretraining data, offering actionable guidelines for designing robust, scalable SSL pipelines in video understanding.

Abstract

Self-supervised learning has emerged as a powerful paradigm for label-free model pretraining, particularly in the video domain, where manual annotation is costly and time-intensive. However, existing self-supervised approaches employ diverse experimental setups, making direct comparisons challenging due to the absence of a standardized benchmark. In this work, we establish a unified benchmark that enables fair comparisons across different methods. Additionally, we systematically investigate five critical aspects of self-supervised learning in videos: (1) dataset size, (2) model complexity, (3) data distribution, (4) data noise, and (5) feature representations. To facilitate this study, we evaluate six self-supervised learning methods across six network architectures, conducting extensive experiments on five benchmark datasets and assessing performance on two distinct downstream tasks. Our analysis reveals key insights into the interplay between pretraining strategies, dataset characteristics, pretext tasks, and model architectures. Furthermore, we extend these findings to Video Foundation Models (ViFMs), demonstrating their relevance in large-scale video representation learning. Finally, leveraging these insights, we propose a novel approach that significantly reduces training data requirements while surpassing state-of-the-art methods that rely on 10% more pretraining data. We believe this work will guide future research toward a deeper understanding of self-supervised video representation learning and its broader implications.

Paper Structure

This paper contains 50 sections, 14 figures, 17 tables.

Figures (14)

  • Figure 1: Overview of proposed benchmark. We study five different aspects in this benchmark study. Starting from left, 1) we show the analysis of effect of dataset size vs training time. As the dataset size increases, variation in performance decreases even with longer training time, 2) We show the effect of task complexity (C1, C2, C3 - Different complexities). Bottom figure shows use case of how complexity increases for the RotNet task, and, top figure shows how the performance varies for the R21D network, 3) With different data distribution shifts, the third sub-figure shows the impact of target data distribution on the source data, 4) We look into another data distribution shift due to introduction of noise. We see how non-contrastive tasks are more robust than contrastive ones even with increasing levels of severity of noise. The bottom part shows an example for each type of noise. Clips are provided in supplementary, and, 5) Finally, we further analyze whether the features learn orthogonal information. In this sub-figure, we show that using different architectures as teachers can substantially improve performance even in a low-data regime.
  • Figure 2: Left: Dataset subset performance for three different architectures on RSPNet pretext task (x-axis: subset size, y-axis: Top-1 Accuracy). Here, 10 means 10k dataset subset, 30 means 30k, and so on. Right: CKA maps for RSPNet on different subsets with R21D backbone.
  • Figure 3: Effect of different dataset distributions: Here, S, T, and ST mean spatial(CVRL), temporal(VCOP), and, spatio-temporal(RSPNet) respectively. X-axis shows source dataset and Y-axis shows Top-1 accuracy.
  • Figure 4: Feature analysis overview. This figure shows how KD as a tool is beneficial across multiple scenarios. Brief details for each setup (Left to right): (A) Effect of dataset size: Teachers (T1 and T2) are different architectures for a single subset. Student model (ST-Shuffle) CKA maps shows it learns complementary information especially for 30k. (B) Task Complexity: Teachers are multiple complexities across the same task. (C1, C2, C3 - different complexities as teachers.) We observe in most of the scenarios, Student (ST) networks outperforms all teacher models which proves learning of orthogonal information from multiple teachers. (C) Out-of-Distribution: Models from different source datasets are teachers. Student model (ST) outperforms both teachers trained on two different datasets. (D) Pretext Tasks: Spatial and temporal task networks are teachers, and, student model (ST) learnt from two different categories of pretext tasks - spatial and temporal incorporate knowledge from both and outperforms both of the teachers for both contrastive and non-contrastive.
  • Figure 5: Knowledge distillation using teachers trained on multiple subset sizes on RSPNet. Student: ShuffleNet a) UCF101 and b) HMDB51. Here T1 is Teacher-1 (shufflenet) and T2 is teacher-2 (R21D). Top@5 Clip Retrieval - R21D on c) UCF101 and d) HMDB51, pre-trained on K400 and SSv2 - 30k subset.
  • ...and 9 more figures