Table of Contents
Fetching ...

Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective

Zeen Song, Wenwen Qiang, Changwen Zheng, Hui Xiong, Gang Hua

TL;DR

The paper addresses the gap in video self-supervised learning where current V-CL methods fail to jointly capture static (appearance/background) and dynamic (motion) semantics due to dataset confounding. It introduces BOD-VCL, a Bi-level Optimization with Decoupling framework that uses a learned Koopman operator $\hat{\mathcal{K}}_\theta$ to decompose video representations into time-invariant and time-variant components, enabling separate optimization of static and dynamic similarities via decoupled losses $\mathcal{L}_A$ and $\mathcal{L}_M$. Theoretical analysis based on SCM and gradient dynamics shows that unified losses tend to collapse onto the easier semantics, and decoupled learning prevents this collapse, yielding richer representations. Empirically, BOD-VCL improves performance across action recognition, detection, and motion-centric datasets, with ablations verifying the effectiveness of bi-level optimization, Koopman-based decomposition, and decoupled losses. This approach offers a principled path to robust, semantics-complete video representations with practical gains on a wide range of downstream tasks.

Abstract

Video contrastive learning (V-CL) has emerged as a popular framework for unsupervised video representation learning, demonstrating strong results in tasks such as action classification and detection. Yet, to harness these benefits, it is critical for the learned representations to fully capture both static and dynamic semantics. However, our experiments show that existing V-CL methods fail to effectively learn either type of feature. Through a rigorous theoretical analysis based on the Structural Causal Model and gradient update, we find that in a given dataset, certain static semantics consistently co-occur with specific dynamic semantics. This phenomenon creates spurious correlations between static and dynamic semantics in the dataset. However, existing V-CL methods do not differentiate static and dynamic similarities when computing sample similarity. As a result, learning only one type of semantics is sufficient for the model to minimize the contrastive loss. Ultimately, this causes the V-CL pre-training process to prioritize learning the easier-to-learn semantics. To address this limitation, we propose Bi-level Optimization with Decoupling for Video Contrastive Learning. (BOD-VCL). In BOD-VCL, we model videos as linear dynamical systems based on Koopman theory. In this system, all frame-to-frame transitions are represented by a linear Koopman operator. By performing eigen-decomposition on this operator, we can separate time-variant and time-invariant components of semantics, which allows us to explicitly separate the static and dynamic semantics in the video. By modeling static and dynamic similarity separately, both types of semantics can be fully exploited during the V-CL training process. BOD-VCL can be seamlessly integrated into existing V-CL frameworks, and experimental results highlight the significant improvements achieved by our method.

Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective

TL;DR

The paper addresses the gap in video self-supervised learning where current V-CL methods fail to jointly capture static (appearance/background) and dynamic (motion) semantics due to dataset confounding. It introduces BOD-VCL, a Bi-level Optimization with Decoupling framework that uses a learned Koopman operator to decompose video representations into time-invariant and time-variant components, enabling separate optimization of static and dynamic similarities via decoupled losses and . Theoretical analysis based on SCM and gradient dynamics shows that unified losses tend to collapse onto the easier semantics, and decoupled learning prevents this collapse, yielding richer representations. Empirically, BOD-VCL improves performance across action recognition, detection, and motion-centric datasets, with ablations verifying the effectiveness of bi-level optimization, Koopman-based decomposition, and decoupled losses. This approach offers a principled path to robust, semantics-complete video representations with practical gains on a wide range of downstream tasks.

Abstract

Video contrastive learning (V-CL) has emerged as a popular framework for unsupervised video representation learning, demonstrating strong results in tasks such as action classification and detection. Yet, to harness these benefits, it is critical for the learned representations to fully capture both static and dynamic semantics. However, our experiments show that existing V-CL methods fail to effectively learn either type of feature. Through a rigorous theoretical analysis based on the Structural Causal Model and gradient update, we find that in a given dataset, certain static semantics consistently co-occur with specific dynamic semantics. This phenomenon creates spurious correlations between static and dynamic semantics in the dataset. However, existing V-CL methods do not differentiate static and dynamic similarities when computing sample similarity. As a result, learning only one type of semantics is sufficient for the model to minimize the contrastive loss. Ultimately, this causes the V-CL pre-training process to prioritize learning the easier-to-learn semantics. To address this limitation, we propose Bi-level Optimization with Decoupling for Video Contrastive Learning. (BOD-VCL). In BOD-VCL, we model videos as linear dynamical systems based on Koopman theory. In this system, all frame-to-frame transitions are represented by a linear Koopman operator. By performing eigen-decomposition on this operator, we can separate time-variant and time-invariant components of semantics, which allows us to explicitly separate the static and dynamic semantics in the video. By modeling static and dynamic similarity separately, both types of semantics can be fully exploited during the V-CL training process. BOD-VCL can be seamlessly integrated into existing V-CL frameworks, and experimental results highlight the significant improvements achieved by our method.
Paper Structure (49 sections, 5 theorems, 51 equations, 12 figures, 14 tables, 1 algorithm)

This paper contains 49 sections, 5 theorems, 51 equations, 12 figures, 14 tables, 1 algorithm.

Key Result

proposition thmcounterproposition

Assume that for each $j \in \{1, \dots, k\}$, the pair $(a^{(j)}, m^{(j)})$ is confoundedly correlated, meaning that $a^{(j)}$ and $m^{(j)}$ provide redundant information about the similarity structure $S$. Suppose further that, during training (i.e., along the gradient descent iterations), for each depending on which component is harder to learn. Consequently, rather than jointly encoding both $a

Figures (12)

  • Figure 1: (a) Visualization of static and dynamic semantics from a video capturing a tennis match. (b) Illustration of the motivation experiments in Static UCF-101 dataset and Dynamic UCF-101 dataset.
  • Figure 2: A graphical illustration of the proposed SCM. In this SCM, $\mathbf{X}$ represents the video samples, $\mathbf{A}$ represents static semantics, $\mathbf{M}$ represents dynamic semantics, and $\mathbf{S}$ represents the similarity score.
  • Figure 3: Overview of the proposed BOD-VCL framework. (a) The process of the estimation of $\hat{\mathcal{K}}_\theta$. $\mathcal{L}_{pred}(\theta)$ ensure the estimated $\hat{\mathcal{K}}_\theta$ precisely describe inter-frame evolution. $X_{back}$ denotes the first $T-1$ frames while $X_{fore}$ denotes the subsequent $T-1$ frames of video $X$. (b) Eigen-decomposition is performed on $\hat{\mathcal{K}}_\theta$ to separate eigenvalues into time-invariant and time-variant subsets, corresponding to static and dynamic semantics, respectively. The associated eigenvectors are then used to extract static and dynamic features independently. (c) The separated static and dynamic semantic representations are utilized to independently optimize contrastive objectives, ensuring effective learning of both components. (d) The overall optimization is formulated as a bi-level process. In the first stage, the prediction loss $\mathcal{L}_{pred}$ is minimized to obtain a reliable Koopman operator $\hat{\mathcal{K}}_\theta$. In the second stage, given this operator, the decoupled contrastive losses $\mathcal{L}_A$ and $\mathcal{L}_M$ are optimized to obtain better static and dynamic representations.
  • Figure 4: (a) The illustration of different eigenvalues $\lambda_m$ in the complex plane. The eigenvectors near $(1 + 0j)$ are related to the time-invariant semantics, while others are related to time-variant semantics. (b) Temporal evolution of feature amplitudes $\varphi_m^\top f_\theta(x_{t+k})$ corresponding to different $\lambda_m$ over time. Specifically, when $\lambda_m\approx1$, $\varphi_m^\top f_\theta(x_{t+k})$ remains nearly constant for different value of $k$.
  • Figure 5: Examples from the constructed synthetic dataset. (a) and (e) represent simple static semantics defined by background color (blue and green). (b) and (f) show complex static semantics defined by foreground shape (pentagon and circle). (c) and (g) illustrate simple dynamic semantics defined by background brightness change (brightening and darkening). (d) and (h) correspond to complex dynamic semantics defined by foreground motion trajectory (rotation and linear motion).
  • ...and 7 more figures

Theorems & Definitions (13)

  • definition thmcounterdefinition
  • proposition thmcounterproposition: Incomplete Semantic Learning with Unified Loss
  • corollary thmcountercorollary: Complete Representation with Decoupled Losses
  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • proposition thmcounterproposition: Incomplete Semantic Learning with Unified Loss
  • proof
  • corollary thmcountercorollary: Complete Representation with Decoupled Losses
  • proof
  • ...and 3 more