Table of Contents
Fetching ...

Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning

Minghao Zhu, Xiao Lin, Ronghao Dang, Chengju Liu, Qijun Chen

TL;DR

This paper develops a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision and designs a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space.

Abstract

As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a \textbf{Fi}ne-grained \textbf{M}otion \textbf{A}lignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space. Moreover, a frame-level motion contrastive loss is presented to improve the temporal diversity of the motion features. Extensive experiments demonstrate that the representations learned by FIMA possess great motion-awareness capabilities and achieve state-of-the-art or competitive results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code is available at \url{https://github.com/ZMHH-H/FIMA}.

Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning

TL;DR

This paper develops a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision and designs a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space.

Abstract

As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a \textbf{Fi}ne-grained \textbf{M}otion \textbf{A}lignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space. Moreover, a frame-level motion contrastive loss is presented to improve the temporal diversity of the motion features. Extensive experiments demonstrate that the representations learned by FIMA possess great motion-awareness capabilities and achieve state-of-the-art or competitive results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code is available at \url{https://github.com/ZMHH-H/FIMA}.
Paper Structure (32 sections, 8 equations, 11 figures, 11 tables)

This paper contains 32 sections, 8 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: An illustration of weak alignment across modalities. For temporal weak alignment, since motion semantics varies over time, direct alignment of features with different timestamps introduces minimal shared information. For spatial weak alignment, the pooled feature is distracted by cluttered background noise, leading to misalignment of background features and inadequate alignment of foreground features.
  • Figure 2: Overview of the framework. We sample two temporally distant clips $\{V, \Tilde{V}\}$ and compute the frame difference $\Tilde{D}$. The corresponding dense feature maps $\{F^q, \Tilde{F}^q, \Tilde{M}^k\}$ are extracted by the encoder or its momentum version. We sample the foreground features at the $i$-th frame of $F^q$ and concatenate them with a class token, then feed them into the motion decoder. We use the motion decoder to reconstruct the foreground features of $\Tilde{M}^k$ in the $i$-th frame, by collecting information from $\Tilde{F}_i^q$. Finally, the class token is used to reconstruct the local motion feature with time interval overlaps exactly with $\Tilde{F}_i^q$.
  • Figure 3: Class-agnostic activation map visualization for MoCo baseline (middle column) and MoCo+$\mathcal{L}_{\mathrm{Pix}}$ (right column). $\mathcal{L}_{\mathrm{Pix}}$ is effective for alleviating background bias.
  • Figure 4: Class-agnostic activation map visualization for MoCo+$\mathcal{L}_{\mathrm{VD}}$ (middle column) and MoCo+$\mathcal{L}_{\mathrm{Pix}}$ (right column). Pre-training with $\mathcal{L}_{\mathrm{Pix}}$ provides richer motion information.
  • Figure 5: (a) Spatial affinity matrices and (b) Temporal similarity statistics between RGB features and motion features with MoCo+$\mathcal{L}_{\mathrm{VD}}$ pre-training and FIMA pre-training.
  • ...and 6 more figures