Table of Contents
Fetching ...

Advancing Video Self-Supervised Learning via Image Foundation Models

Jingwei Wu, Zhewei Huang, Chang Liu

TL;DR

This work tackles the high computational cost of video self-supervised learning by proposing AdViSe, a framework that freezes Image Foundation Models (IFMs) and trains a lightweight Temporal Modeling Module (TMM) on top to capture temporal dynamics. A Playback Rate Perception (PRP) pretext task guides temporal aggregation while preserving the IFM, with a Spatial Feature Utilization (SFU) stage compressing spatial features prior to temporal fusion. Empirical results on benchmarks like UCF101, HMDB51, Diving48, and SSv2 show AdViSe achieving competitive accuracy while delivering up to 3.4× faster training and 8.2× lower GPU memory usage, illustrating substantial efficiency gains. The study also provides actionable design guidelines for SFU and TMM configurations and demonstrates that the approach scales with stronger IFMs, highlighting its practical impact for cost-efficient video representation learning.

Abstract

In the past decade, image foundation models (IFMs) have achieved unprecedented progress. However, the potential of directly using IFMs for video self-supervised representation learning has largely been overlooked. In this study, we propose an advancing video self-supervised learning (AdViSe) approach, aimed at significantly reducing the training overhead of video representation models using pre-trained IFMs. Specifically, we first introduce temporal modeling modules (ResNet3D) to IFMs, constructing a video representation model. We then employ a video self-supervised learning approach, playback rate perception, to train temporal modules while freezing the IFM components. Experiments on UCF101 demonstrate that AdViSe achieves performance comparable to state-of-the-art methods while reducing training time by $3.4\times$ and GPU memory usage by $8.2\times$. This study offers fresh insights into low-cost video self-supervised learning based on pre-trained IFMs. Code is available at https://github.com/JingwWu/advise-video-ssl.

Advancing Video Self-Supervised Learning via Image Foundation Models

TL;DR

This work tackles the high computational cost of video self-supervised learning by proposing AdViSe, a framework that freezes Image Foundation Models (IFMs) and trains a lightweight Temporal Modeling Module (TMM) on top to capture temporal dynamics. A Playback Rate Perception (PRP) pretext task guides temporal aggregation while preserving the IFM, with a Spatial Feature Utilization (SFU) stage compressing spatial features prior to temporal fusion. Empirical results on benchmarks like UCF101, HMDB51, Diving48, and SSv2 show AdViSe achieving competitive accuracy while delivering up to 3.4× faster training and 8.2× lower GPU memory usage, illustrating substantial efficiency gains. The study also provides actionable design guidelines for SFU and TMM configurations and demonstrates that the approach scales with stronger IFMs, highlighting its practical impact for cost-efficient video representation learning.

Abstract

In the past decade, image foundation models (IFMs) have achieved unprecedented progress. However, the potential of directly using IFMs for video self-supervised representation learning has largely been overlooked. In this study, we propose an advancing video self-supervised learning (AdViSe) approach, aimed at significantly reducing the training overhead of video representation models using pre-trained IFMs. Specifically, we first introduce temporal modeling modules (ResNet3D) to IFMs, constructing a video representation model. We then employ a video self-supervised learning approach, playback rate perception, to train temporal modules while freezing the IFM components. Experiments on UCF101 demonstrate that AdViSe achieves performance comparable to state-of-the-art methods while reducing training time by and GPU memory usage by . This study offers fresh insights into low-cost video self-supervised learning based on pre-trained IFMs. Code is available at https://github.com/JingwWu/advise-video-ssl.

Paper Structure

This paper contains 31 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: AdViSe utilizes IFMs to implement efficient video self-supervised learning. (a) With a video SSL method yao2020video, AdViSe significantly improves performance upon R3D-50 feichtenhofer2021large with much lower training costs. (b1, b2) As IFMs radford2021learning enhance their spatial feature encoding capabilities (as indicated by ImageNet deng2009imagenet Zero-Shot Accuracy), the performance of downstream task (action recognition) also improves (AdVise).
  • Figure 2: AdViSe paradigm. It leverages the spatial features produced by the image foundation model (IFM) to train the temporal modeling module (TMM), aggregating temporal information and effectively encoding motion dynamics.
  • Figure 3: Impacts of Spatial resolution compression. Spatial resolution compression significantly reduces the $Acc_{ft}$ metric (bottom) on Diving48 li2018resound, but had a lesser impact on UCF101 soomro2012ucf101, while the ${\Delta}Acc$ metric (top) shows a decline across both datasets.
  • Figure 4: Impacts of Channel dimension compression. Channel dimension compression reduces $Acc_{ft}$ performance (bottom) on both UCF101 soomro2012ucf101 and Diving48 li2018resound. For ${\Delta}Acc$ metric (top), only Diving48 experiences some impact.
  • Figure 5: Impacts of TMM blocks. Increasing TMM block number improves the accuracy of fine-tuning ($Acc_{ft}$) on Diving48 li2018resound but reduces it on UCF101 soomro2012ucf101.
  • ...and 2 more figures