Collaboratively Self-supervised Video Representation Learning for Action Recognition

Jie Zhang; Zhifan Wan; Lanqing Hu; Stephen Lin; Shuzhe Wu; Shiguang Shan

Collaboratively Self-supervised Video Representation Learning for Action Recognition

Jie Zhang, Zhifan Wan, Lanqing Hu, Stephen Lin, Shuzhe Wu, Shiguang Shan

TL;DR

The paper tackles the data scarcity of action recognition by learning robust video representations via self-supervision. It introduces CSVR, a three-branch framework that jointly optimizes Generative Pose Prediction for dynamic motion ($f_d$), Discriminative Context Matching for static context ($f_s$ via MI-InfoNCE), and a Collaborative Video Generating Branch that fuses these signals into an integrated feature $f_i$ through AdaIN and leverages two TGANv2 networks for reconstruction and future prediction. The approach achieves state-of-the-art results on multiple datasets (e.g., UCF101, HMDB51, SSv2) and shows strong gains in both action recognition and video retrieval, with ablations confirming the contributions of each component. By demonstrating effective collaboration between discriminative and generative self-supervised signals, CSVR offers a scalable pathway to high-quality video representations and points to future enhancements with more powerful backbones or diffusion-based generation.

Abstract

Considering the close connection between action recognition and human pose estimation, we design a Collaboratively Self-supervised Video Representation (CSVR) learning framework specific to action recognition by jointly factoring in generative pose prediction and discriminative context matching as pretext tasks. Specifically, our CSVR consists of three branches: a generative pose prediction branch, a discriminative context matching branch, and a video generating branch. Among them, the first one encodes dynamic motion feature by utilizing Conditional-GAN to predict the human poses of future frames, and the second branch extracts static context features by contrasting positive and negative video feature and I-frame feature pairs. The third branch is designed to generate both current and future video frames, for the purpose of collaboratively improving dynamic motion features and static context features. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple popular video datasets.

Collaboratively Self-supervised Video Representation Learning for Action Recognition

TL;DR

), Discriminative Context Matching for static context (

via MI-InfoNCE), and a Collaborative Video Generating Branch that fuses these signals into an integrated feature

through AdaIN and leverages two TGANv2 networks for reconstruction and future prediction. The approach achieves state-of-the-art results on multiple datasets (e.g., UCF101, HMDB51, SSv2) and shows strong gains in both action recognition and video retrieval, with ablations confirming the contributions of each component. By demonstrating effective collaboration between discriminative and generative self-supervised signals, CSVR offers a scalable pathway to high-quality video representations and points to future enhancements with more powerful backbones or diffusion-based generation.

Abstract

Paper Structure (18 sections, 10 equations, 4 figures, 9 tables)

This paper contains 18 sections, 10 equations, 4 figures, 9 tables.

Introduction
Related Work
Video-Based Action Recognition
Self-Supervised Video Representation Learning
Video Generation
Methodology
Overall Framework
Generative Pose Prediction Branch
Discriminative Context Matching Branch
Collaborative Video Generating Branch
Joint Optimization
Experiments
Dataset
Implementation Details
Ablation Study
...and 3 more sections

Figures (4)

Figure 1: An overview of our self-supervised learning progress. Our CSVR contains three branches: pose prediction branch, context matching branch, and video generating branch. The pose prediction branch uses poses from input video clips to predict future poses. Dynamic motion feature $f_d$ is originating from the encoder of pose generator. The context matching branch captures video clip features and intra frame (I-frame) features by CNNs, and then casts a contrastive loss on learned features to pull the feature pairs from the same video clips together. Through the contrastive learning method, video clip feature $f_s$ can contain rich static context information. After fusing previous features by the integration layer, integrated feature $f_i$ is fed into the video generating networks, try to reconstruct current video clips and predict future video clips, respectively. The video generating branch jointly optimizes all the branches, leading to learn more comprehensive video representation.
Figure 2: Detailed structure of integration layer. In the process of generating the integrated feature $f_i$, we use the parameters generated from $f_d$ to denormalize the normalized $\Bar{f_s}$, and combine the denormalized feature $D$ and static context feature $f_s$ with the an attention mask $M$.
Figure 3: Action recognition results on UCF101 with different combinations of $\sigma_p,\sigma_c$.
Figure 4: Per-class action recognition accuracy with different features. Labels on the horizontal axis are divided into three groups: the orange group are highly related to motion information, the blue group can be easily recognized by static appearance, the black group exhibit both fixed action patterns and rich scene changes. The figure reveals the contributions of different features in action recognition.

Collaboratively Self-supervised Video Representation Learning for Action Recognition

TL;DR

Abstract

Collaboratively Self-supervised Video Representation Learning for Action Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (4)