Table of Contents
Fetching ...

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

Fida Mohammad Thoker, Letian Jiang, Chen Zhao, Bernard Ghanem

TL;DR

SMILE addresses the limitations of pixel-focused masked video modeling by integrating high-level spatial semantics from CLIP and injecting synthetic object motion to emphasize temporal dynamics. It replaces pixel reconstruction with CLIP-feature reconstruction in a teacher-student framework and uses two masking schemes (tube and trajectory-based) on original and added-object tokens, respectively. Through motion augmentation and CLIP supervision, SMILE achieves state-of-the-art results on multiple action-recognition benchmarks and demonstrates robust generalization, including learning representations without natural videos. This approach establishes a new self-supervised paradigm for robust, motion-aware video representation learning with strong practical impact for diverse video understanding tasks.

Abstract

Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal redundancy, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. In SMILE, we leverage image-language pretrained models, such as CLIP, to guide the learning process with their high-level spatial semantics. We enhance the representation of motion by introducing synthetic motion patterns in the training data, allowing the model to capture more complex and dynamic content. Furthermore, using SMILE, we establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data. We have carried out extensive experiments on 7 datasets with various downstream scenarios. SMILE surpasses current state-of-the-art SSL methods, showcasing its effectiveness in learning more discriminative and generalizable video representations. Code is available: https://github.com/fmthoker/SMILE

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

TL;DR

SMILE addresses the limitations of pixel-focused masked video modeling by integrating high-level spatial semantics from CLIP and injecting synthetic object motion to emphasize temporal dynamics. It replaces pixel reconstruction with CLIP-feature reconstruction in a teacher-student framework and uses two masking schemes (tube and trajectory-based) on original and added-object tokens, respectively. Through motion augmentation and CLIP supervision, SMILE achieves state-of-the-art results on multiple action-recognition benchmarks and demonstrates robust generalization, including learning representations without natural videos. This approach establishes a new self-supervised paradigm for robust, motion-aware video representation learning with strong practical impact for diverse video understanding tasks.

Abstract

Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal redundancy, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. In SMILE, we leverage image-language pretrained models, such as CLIP, to guide the learning process with their high-level spatial semantics. We enhance the representation of motion by introducing synthetic motion patterns in the training data, allowing the model to capture more complex and dynamic content. Furthermore, using SMILE, we establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data. We have carried out extensive experiments on 7 datasets with various downstream scenarios. SMILE surpasses current state-of-the-art SSL methods, showcasing its effectiveness in learning more discriminative and generalizable video representations. Code is available: https://github.com/fmthoker/SMILE

Paper Structure

This paper contains 39 sections, 2 equations, 6 figures, 18 tables.

Figures (6)

  • Figure 1: Comparison with SOTA masked video modeling methods. Our method significantly outperforms prior masked video modeling methods across diverse downstream settings.
  • Figure 2: Overall architecture of our SMILE. An input video clip $V$ is overlaid with a segmented object along a randomly generated trajectory to generate $V^{'}$ infusing synthetic object motion in $V$. $V^{'}$ is passed frame-by-frame through the CLIP encoder to extract feature tokens $\mathcal{F}$. $V^{'}$ is patchified into a set of space-time tokens $\mathcal{T}$. We apply two types of masking upon $\mathcal{T}$, tube masking on the space-time tokens of the original video and trajectory-based masking on the tokens of added objects. The unmasked tokens $\mathcal{T}_{unmask}$ are fed into the encoder-decoder network $\Phi_{enc}$$\Phi_{dec}$ which is trained to reconstruct the masked feature tokens in $\mathcal{F}$.
  • Figure 3: Motion overlaid videos showing the position change and transformation of the added objects along the time dimension.
  • Figure 4: Feature similarity across different frames for different SSL methods. We compute this on K400 validation videos.
  • Figure 5: Performance comparison with a fixed training budget. We evaluate on SSv2 and GYM for full finetuning. Our method consistently outperforms VideoMAE tong2022videomae across all data scales with the same training budget.
  • ...and 1 more figures