Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective
Zeen Song, Wenwen Qiang, Changwen Zheng, Hui Xiong, Gang Hua
TL;DR
The paper addresses the gap in video self-supervised learning where current V-CL methods fail to jointly capture static (appearance/background) and dynamic (motion) semantics due to dataset confounding. It introduces BOD-VCL, a Bi-level Optimization with Decoupling framework that uses a learned Koopman operator $\hat{\mathcal{K}}_\theta$ to decompose video representations into time-invariant and time-variant components, enabling separate optimization of static and dynamic similarities via decoupled losses $\mathcal{L}_A$ and $\mathcal{L}_M$. Theoretical analysis based on SCM and gradient dynamics shows that unified losses tend to collapse onto the easier semantics, and decoupled learning prevents this collapse, yielding richer representations. Empirically, BOD-VCL improves performance across action recognition, detection, and motion-centric datasets, with ablations verifying the effectiveness of bi-level optimization, Koopman-based decomposition, and decoupled losses. This approach offers a principled path to robust, semantics-complete video representations with practical gains on a wide range of downstream tasks.
Abstract
Video contrastive learning (V-CL) has emerged as a popular framework for unsupervised video representation learning, demonstrating strong results in tasks such as action classification and detection. Yet, to harness these benefits, it is critical for the learned representations to fully capture both static and dynamic semantics. However, our experiments show that existing V-CL methods fail to effectively learn either type of feature. Through a rigorous theoretical analysis based on the Structural Causal Model and gradient update, we find that in a given dataset, certain static semantics consistently co-occur with specific dynamic semantics. This phenomenon creates spurious correlations between static and dynamic semantics in the dataset. However, existing V-CL methods do not differentiate static and dynamic similarities when computing sample similarity. As a result, learning only one type of semantics is sufficient for the model to minimize the contrastive loss. Ultimately, this causes the V-CL pre-training process to prioritize learning the easier-to-learn semantics. To address this limitation, we propose Bi-level Optimization with Decoupling for Video Contrastive Learning. (BOD-VCL). In BOD-VCL, we model videos as linear dynamical systems based on Koopman theory. In this system, all frame-to-frame transitions are represented by a linear Koopman operator. By performing eigen-decomposition on this operator, we can separate time-variant and time-invariant components of semantics, which allows us to explicitly separate the static and dynamic semantics in the video. By modeling static and dynamic similarity separately, both types of semantics can be fully exploited during the V-CL training process. BOD-VCL can be seamlessly integrated into existing V-CL frameworks, and experimental results highlight the significant improvements achieved by our method.
