Table of Contents
Fetching ...

CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders

Shihab Aaqil Ahamed, Malitha Gunawardhana, Liel David, Michael Sidorov, Daniel Harari, Muhammad Haris Khan

TL;DR

CrossVideoMAE tackles the limitation of existing video MAEs by introducing a cross-modal SSL framework that jointly learns video-level and frame-level spatiotemporal representations and semantic attributes through intra- and cross-modal contrastive objectives. It leverages a video branch (SpatioTemporalMAE) and an image branch (pre-trained MAE) to distill semantic cues from sampled frames into videos, enforcing invariance to video augmentations and cross-modal correspondences. The method combines intra-modal NT-Xent losses, cross-modal alignment, and reconstruction/MSE losses into a unified objective, showing strong improvements on SSv2 and competitive results on other datasets, with ablations supporting key design choices such as masking ratios, frame sampling, and joint objectives. This approach demonstrates efficient, label-free learning of rich video representations with practical transferability to downstream action recognition and retrieval tasks.

Abstract

Current video-based Masked Autoencoders (MAEs) primarily focus on learning effective spatiotemporal representations from a visual perspective, which may lead the model to prioritize general spatial-temporal patterns but often overlook nuanced semantic attributes like specific interactions or sequences that define actions - such as action-specific features that align more closely with human cognition for space-time correspondence. This can limit the model's ability to capture the essence of certain actions that are contextually rich and continuous. Humans are capable of mapping visual concepts, object view invariance, and semantic attributes available in static instances to comprehend natural dynamic scenes or videos. Existing MAEs for videos and static images rely on separate datasets for videos and images, which may lack the rich semantic attributes necessary for fully understanding the learned concepts, especially when compared to using video and corresponding sampled frame images together. To this end, we propose CrossVideoMAE an end-to-end self-supervised cross-modal contrastive learning MAE that effectively learns both video-level and frame-level rich spatiotemporal representations and semantic attributes. Our method integrates mutual spatiotemporal information from videos with spatial information from sampled frames within a feature-invariant space, while encouraging invariance to augmentations within the video domain. This objective is achieved through jointly embedding features of visible tokens and combining feature correspondence within and across modalities, which is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner. Extensive experiments demonstrate that our approach surpasses previous state-of-the-art methods and ablation studies validate the effectiveness of our approach.

CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders

TL;DR

CrossVideoMAE tackles the limitation of existing video MAEs by introducing a cross-modal SSL framework that jointly learns video-level and frame-level spatiotemporal representations and semantic attributes through intra- and cross-modal contrastive objectives. It leverages a video branch (SpatioTemporalMAE) and an image branch (pre-trained MAE) to distill semantic cues from sampled frames into videos, enforcing invariance to video augmentations and cross-modal correspondences. The method combines intra-modal NT-Xent losses, cross-modal alignment, and reconstruction/MSE losses into a unified objective, showing strong improvements on SSv2 and competitive results on other datasets, with ablations supporting key design choices such as masking ratios, frame sampling, and joint objectives. This approach demonstrates efficient, label-free learning of rich video representations with practical transferability to downstream action recognition and retrieval tasks.

Abstract

Current video-based Masked Autoencoders (MAEs) primarily focus on learning effective spatiotemporal representations from a visual perspective, which may lead the model to prioritize general spatial-temporal patterns but often overlook nuanced semantic attributes like specific interactions or sequences that define actions - such as action-specific features that align more closely with human cognition for space-time correspondence. This can limit the model's ability to capture the essence of certain actions that are contextually rich and continuous. Humans are capable of mapping visual concepts, object view invariance, and semantic attributes available in static instances to comprehend natural dynamic scenes or videos. Existing MAEs for videos and static images rely on separate datasets for videos and images, which may lack the rich semantic attributes necessary for fully understanding the learned concepts, especially when compared to using video and corresponding sampled frame images together. To this end, we propose CrossVideoMAE an end-to-end self-supervised cross-modal contrastive learning MAE that effectively learns both video-level and frame-level rich spatiotemporal representations and semantic attributes. Our method integrates mutual spatiotemporal information from videos with spatial information from sampled frames within a feature-invariant space, while encouraging invariance to augmentations within the video domain. This objective is achieved through jointly embedding features of visible tokens and combining feature correspondence within and across modalities, which is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner. Extensive experiments demonstrate that our approach surpasses previous state-of-the-art methods and ablation studies validate the effectiveness of our approach.

Paper Structure

This paper contains 42 sections, 11 equations, 19 figures, 21 tables.

Figures (19)

  • Figure 1: Self-attention maps visualization of the proposed approach.. This demonstrates the efficacy of our method in learning spatiotemporal and semantic representations. The rows depict: original video frames from an action video sequence (first row), masked frames with random masking applied (second row), reconstructed frames (third row), self-attention heatmaps highlighting spatiotemporal representations (fourth row), overlaid self-attention heatmaps on reconstructed frames (fifth row), and semantic self-attention maps visualizing semantic attributes(sixth row). Our approach aim to capture spatiotemporal-spatial feature embedding correspondence of visible tokens across sampled frames and videos, utilizing differences between masking ratios (90% and 95%), to relate high-level visual and semantic tokens that encode intricate relationships. This joint intra-modal and cross-modal feature embedding at both video and frame level settings enhances invariance to augmentations in the video domain and facilitates effective semantic knowledge distillation from sampled frames to videos. (ref supplementary for more visualizations.)
  • Figure 1: An example self-attention maps visualization of our CrossVideoMAE on the K400 dataset.
  • Figure 2: A). The proposed CrossVideoMAE framework comprises two branches: the video branch and the image branch. The video branch employs intra-modal pre-training to ensure that the encoder develops invariance to augmentations within the video domain. The image branch leverages are cross-modal pre-training to distill semantic knowledge from pre-trained MAE he2022masked, transferring insights from sampled frames to corresponding videos. The model is pre-trained jointly across video and image domains using a combination of intra-modal and cross-modal contrastive learning objectives at both the video and frame levels. For downstream tasks, the image branch is discarded, and only the video branch encoder is utilized as the backbone. B). Zoom in version of the feature space. This approach demonstrates the spatiotemporal-spatial alignment of feature embedding correspondence for visible tokens, ensuring invariance at both the video level and frame level, enhancing the representation robustness.
  • Figure 2: An example self-attention maps visualization of our CrossVideoMAE on the K400 dataset for a masking ratio of 95%.
  • Figure 3: An example self-attention maps visualization of our CrossVideoMAE on the K400 dataset for a masking ratio of 95%.
  • ...and 14 more figures