Table of Contents
Fetching ...

Video Representation Learning with Visual Tempo Consistency

Ceyuan Yang, Yinghao Xu, Bo Dai, Bolei Zhou

TL;DR

This work introduces Visual Tempo Consistency as a self-supervised signal for video representation learning by contrasting same-action instances captured at slow and fast tempos. It proposes a hierarchical contrastive framework (VTHCL) that leverages temporal features from multiple network depths to strengthen supervision and employs memory banks to scale the training. An Instance Correspondence Map (ICM) provides a qualitative visualization of the shared semantics learned across tempos. Experiments on UCF-101, HMDB-51, AVA, and Epic-Kitchen show competitive action recognition and promising transfer to detection and anticipation tasks, highlighting the method's generalization and interpretability.

Abstract

Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by sampling the same instance at slow and fast frame rates respectively, we can obtain slow and fast video frames which share the same semantics but contain different visual tempos. Video representations learned from VTHCL achieve the competitive performances under the self-supervision evaluation protocol for action recognition on UCF-101 (82.1\%) and HMDB-51 (49.2\%). Moreover, comprehensive experiments suggest that the learned representations are generalized well to other downstream tasks including action detection on AVA and action anticipation on Epic-Kitchen. Finally, we propose Instance Correspondence Map (ICM) to visualize the shared semantics captured by contrastive learning.

Video Representation Learning with Visual Tempo Consistency

TL;DR

This work introduces Visual Tempo Consistency as a self-supervised signal for video representation learning by contrasting same-action instances captured at slow and fast tempos. It proposes a hierarchical contrastive framework (VTHCL) that leverages temporal features from multiple network depths to strengthen supervision and employs memory banks to scale the training. An Instance Correspondence Map (ICM) provides a qualitative visualization of the shared semantics learned across tempos. Experiments on UCF-101, HMDB-51, AVA, and Epic-Kitchen show competitive action recognition and promising transfer to detection and anticipation tasks, highlighting the method's generalization and interpretability.

Abstract

Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by sampling the same instance at slow and fast frame rates respectively, we can obtain slow and fast video frames which share the same semantics but contain different visual tempos. Video representations learned from VTHCL achieve the competitive performances under the self-supervision evaluation protocol for action recognition on UCF-101 (82.1\%) and HMDB-51 (49.2\%). Moreover, comprehensive experiments suggest that the learned representations are generalized well to other downstream tasks including action detection on AVA and action anticipation on Epic-Kitchen. Finally, we propose Instance Correspondence Map (ICM) to visualize the shared semantics captured by contrastive learning.

Paper Structure

This paper contains 17 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Visual Tempo Consistency enforces the network to learn the high representational similarity between the same instance (e.g.$V_i$) sampled at different tempos (e.g.$V_{i}^s$ and $V_{i}^f$). Meanwhile, it also follows the same mechanism as previous instance discrimination task wu2018unsupervised which distinguishes individual instances according to the visual cues
  • Figure 2: Framework. (a) The same instance with different tempos (e.g.$V_{i}^f$ and $V_{i}^s$) should share high similarity in terms of their discriminative semantics while are dissimilar to other instances (grey dots). (b) The features at various depths of networks allow to construct the hierarchical representation spaces
  • Figure 3: Illustration of ICM. (a) The similarity measurement between a positive pair (the blue and yellow). (b) Using one sample in the pair as the reference, ICM highlights the instance-specific shared regions. Note that the channel and temporal dimensions are omitted for brevity
  • Figure 4: Examples of ICMs. Without any annotations, ICM suggests that encoders try to spatially and temporally localize the core objects (i.e. moving and salient objects for dynamic and static scenes respectively), when minimizing the contrastive loss