Table of Contents
Fetching ...

From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Xilin Zhao, Qingming Huang

Abstract

Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available at https://github.com/yafeng19/Co-Settle.

From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning

Abstract

Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available at https://github.com/yafeng19/Co-Settle.

Paper Structure

This paper contains 52 sections, 5 theorems, 64 equations, 10 figures, 15 tables, 1 algorithm.

Key Result

Theorem 1

Denote the eigenvalues of intra-video covariance matrix $\bm{\Sigma}$ are $\{\sigma_i\}_{i=1}^d$. For case i), let $\{\mu_i\}_{i=1}^d$ be the eigenvalues of $\bm{W}$. Assume $\bm{W}$ and $\bm{\Sigma}$ are positive semi-definite. For case ii), let $\{\mu_{1,i}\}_{i=1}^d$ and $\{\mu_{2,i}\}_{i=1}^d$ b

Figures (10)

  • Figure 1: Comparison of video representation quality with recent visual representation learning models on the Kinetics-400 Kinetics validation set. Favorable video representations should exhibit strong intra-video temporal consistency (lower intra-video distance $D_{intra}$) and clear inter-video semantic separability (higher inter-video distance $D_{inter}$) jointly, yet the two objectives often compete since the two distances co-vary. Applying our method to image-pretrained models leads to consistent improvements on the margin of inter- and intra-video distance $D=D_{inter}-\gamma D_{intra}$ (detailed in \ref{['subsec:distance_metrics']}), indicating a better trade-off between the two properties, and therefore leading to improved performance on video downstream tasks.
  • Figure 2: Overview of our image-to-video transfer learning framework. Two frames are sampled from each video to construct a cyclic sequence. A frozen image-pretrained encoder extracts patch-level features, which are then mapped by a learnable projection layer. The projection layer is trained with a temporal cycle-consistency loss and a semantic separability constraint for representation adjustment, thereby promoting a better trade-off between intra-video temporal consistency and inter-video semantic separability.
  • Figure 3: Left: Observations on shortcuts. Patches with the same color box denote correspondence. Middle: Overview of our PEA strategy. Right: Cycle-consistent accuracy and downstream performance dynamics during training with or without our PEA strategy on MAE encoder.
  • Figure 4: Evaluation results on frame-level and video-level tasks based on four representative image models.
  • Figure 5: Comparison of several categories of video representation learning methods with ours.
  • ...and 5 more figures

Theorems & Definitions (15)

  • Theorem 1: Spectral Properties of Optimal Projections, Informal
  • Remark 1
  • Theorem 2: Trade-off Improvement, Informal
  • Remark 2
  • Definition 1: Intra-video Covariance Matrix
  • Definition 2: Inter-video Covariance Matrix
  • Theorem 3: Spectral Properties of Optimal Projections, Formal
  • proof
  • Lemma 1: Trade-off between temporal consistency and semantic separability
  • proof
  • ...and 5 more