Table of Contents
Fetching ...

Text-Guided Video Masked Autoencoder

David Fan, Jue Wang, Shuai Liao, Zhikang Zhang, Vimal Bhat, Xinyu Li

TL;DR

Text-Guided Video Masked Autoencoder introduces a text-driven masking strategy that uses caption-video alignment to identify salient video regions for MAE pretraining, removing reliance on explicit visual priors like motion. It further proposes a unified framework that combines masked video MAE with an optional masked video-text contrastive loss, improving downstream performance across diverse action-recognition datasets. The method leverages off-the-shelf captioning (BLIP) and CLIP-based text-video similarities to guide masking, with optional contrastive learning aligning masked encoder outputs with text representations. Ablations and transfers demonstrate that language-guided masking generalizes to smaller and egocentric datasets, and that masking-guided semantic learning and contrastive signals jointly yield robust gains. Overall, the work establishes language-guided masked video modeling as a practical and scalable direction for improving video representations.

Abstract

Recent video masked autoencoder (MAE) works have designed improved masking algorithms focused on saliency. These works leverage visual cues such as motion to mask the most salient regions. However, the robustness of such visual cues depends on how often input videos match underlying assumptions. On the other hand, natural language description is an information dense representation of video that implicitly captures saliency without requiring modality-specific assumptions, and has not been explored yet for video MAE. To this end, we introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions. Without leveraging any explicit visual cues for saliency, our TGM is competitive with state-of-the-art masking algorithms such as motion-guided masking. To further benefit from the semantics of natural language for masked reconstruction, we next introduce a unified framework for joint MAE and masked video-text contrastive learning. We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE on a variety of video recognition tasks, especially for linear probe. Within this unified framework, our TGM achieves the best relative performance on five action recognition and one egocentric datasets, highlighting the complementary nature of natural language for masked video modeling.

Text-Guided Video Masked Autoencoder

TL;DR

Text-Guided Video Masked Autoencoder introduces a text-driven masking strategy that uses caption-video alignment to identify salient video regions for MAE pretraining, removing reliance on explicit visual priors like motion. It further proposes a unified framework that combines masked video MAE with an optional masked video-text contrastive loss, improving downstream performance across diverse action-recognition datasets. The method leverages off-the-shelf captioning (BLIP) and CLIP-based text-video similarities to guide masking, with optional contrastive learning aligning masked encoder outputs with text representations. Ablations and transfers demonstrate that language-guided masking generalizes to smaller and egocentric datasets, and that masking-guided semantic learning and contrastive signals jointly yield robust gains. Overall, the work establishes language-guided masked video modeling as a practical and scalable direction for improving video representations.

Abstract

Recent video masked autoencoder (MAE) works have designed improved masking algorithms focused on saliency. These works leverage visual cues such as motion to mask the most salient regions. However, the robustness of such visual cues depends on how often input videos match underlying assumptions. On the other hand, natural language description is an information dense representation of video that implicitly captures saliency without requiring modality-specific assumptions, and has not been explored yet for video MAE. To this end, we introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions. Without leveraging any explicit visual cues for saliency, our TGM is competitive with state-of-the-art masking algorithms such as motion-guided masking. To further benefit from the semantics of natural language for masked reconstruction, we next introduce a unified framework for joint MAE and masked video-text contrastive learning. We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE on a variety of video recognition tasks, especially for linear probe. Within this unified framework, our TGM achieves the best relative performance on five action recognition and one egocentric datasets, highlighting the complementary nature of natural language for masked video modeling.
Paper Structure (28 sections, 4 equations, 5 figures, 9 tables)

This paper contains 28 sections, 4 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Illustration of different masking strategies. \ref{['fig:mask_comparison_random']}: Random masking Feichtenhofer2022MaskedAATong2022VideoMAEMA randomly masks patches independently of their contents. \ref{['fig:mask_comparison_motion']}: Motion-guided masking fan2023motionhuang2023mgmae tracks the motion of patches over time to mask a moving volume. \ref{['fig:mask_comparison_text']}: Our proposed text-guided masking masks the top video patch-to-text correspondence.
  • Figure 2: For each video, we generate a caption using an off-shelf image captioning model such as BLIP li2022blip. We then leverage the aligned representation space of CLIP radford2021learning to mask the patches with highest correspondence to the text. The MAE pipeline is identical to VideoMAE Tong2022VideoMAEMA, where the encoder processes the visible patches and the decoder processes the union of encoded visible patches and mask tokens. We additionally introduce an optional contrastive loss to align the encoded visible patches with the text. This facilitates semantic-aware reconstruction. BLIP and CLIP receive no gradients.
  • Figure 3: Visualizations from three perspectives: the visualized mask (row 2), reconstructed RGB output (row 3), and encoder attention map (row 4). Our TGM learns the reconstruction task reasonably well and attends to the salient video regions.
  • Figure 4: Contrastive loss for each mask alg. with and without optimization.
  • Figure 5: Additional visualizations from three perspectives: the visualized mask, reconstructed RGB output, and encoder attention map. We see that our TGM solves the reconstruction task reasonably well and learns to attend to the salient regions of the video.