Video Representation Learning with Visual Tempo Consistency

Ceyuan Yang; Yinghao Xu; Bo Dai; Bolei Zhou

Video Representation Learning with Visual Tempo Consistency

Ceyuan Yang, Yinghao Xu, Bo Dai, Bolei Zhou

TL;DR

This work introduces Visual Tempo Consistency as a self-supervised signal for video representation learning by contrasting same-action instances captured at slow and fast tempos. It proposes a hierarchical contrastive framework (VTHCL) that leverages temporal features from multiple network depths to strengthen supervision and employs memory banks to scale the training. An Instance Correspondence Map (ICM) provides a qualitative visualization of the shared semantics learned across tempos. Experiments on UCF-101, HMDB-51, AVA, and Epic-Kitchen show competitive action recognition and promising transfer to detection and anticipation tasks, highlighting the method's generalization and interpretability.

Abstract

Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by sampling the same instance at slow and fast frame rates respectively, we can obtain slow and fast video frames which share the same semantics but contain different visual tempos. Video representations learned from VTHCL achieve the competitive performances under the self-supervision evaluation protocol for action recognition on UCF-101 (82.1\%) and HMDB-51 (49.2\%). Moreover, comprehensive experiments suggest that the learned representations are generalized well to other downstream tasks including action detection on AVA and action anticipation on Epic-Kitchen. Finally, we propose Instance Correspondence Map (ICM) to visualize the shared semantics captured by contrastive learning.

Video Representation Learning with Visual Tempo Consistency

TL;DR

Abstract

Video Representation Learning with Visual Tempo Consistency

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)