Table of Contents
Fetching ...

OmniMAE: Single Model Masked Pretraining on Images and Videos

Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

TL;DR

OmniMAE investigates unifying image and video representation learning under a single Vision Transformer via masked autoencoding. By jointly pretraining on images and videos with extreme masking ratios, the approach learns transferable, label-free representations and achieves strong finetuning performance on both domains, notably $86.6$% on ImageNet and $75.5$% on SSv2 with ViT-H. The method relies on a simple encoder-decoder MAE objective with a shared backbone and benefits from techniques like sample replication to boost training efficiency. This work demonstrates that a generic, scalable multi-modal pretraining paradigm can rival modality-specific designs and paves the way for broader cross-domain representation learning.

Abstract

Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work studies these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training of huge model architectures. In particular, we show that our single ViT-Huge model can be finetuned to achieve 86.6% on ImageNet and 75.5% on the challenging Something Something-v2 video benchmark, setting a new state-of-the-art.

OmniMAE: Single Model Masked Pretraining on Images and Videos

TL;DR

OmniMAE investigates unifying image and video representation learning under a single Vision Transformer via masked autoencoding. By jointly pretraining on images and videos with extreme masking ratios, the approach learns transferable, label-free representations and achieves strong finetuning performance on both domains, notably % on ImageNet and % on SSv2 with ViT-H. The method relies on a simple encoder-decoder MAE objective with a shared backbone and benefits from techniques like sample replication to boost training efficiency. This work demonstrates that a generic, scalable multi-modal pretraining paradigm can rival modality-specific designs and paves the way for broader cross-domain representation learning.

Abstract

Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work studies these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training of huge model architectures. In particular, we show that our single ViT-Huge model can be finetuned to achieve 86.6% on ImageNet and 75.5% on the challenging Something Something-v2 video benchmark, setting a new state-of-the-art.
Paper Structure (21 sections, 7 figures, 10 tables)

This paper contains 21 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: OmniMAE is a single model for images and videos that is trained using masked autoencoding he2021masked. We use a plain Vision Transformer dosovitskiy2020image architecture but with spatio-temporal patches as input. At training, we 'patchify' the visual input (images or videos), and feed the encoder only a subset of the patches. The decoder reconstructs the pixels for the missing patches using the encoder's output. The encoder-decoder model is trained using a pixel reconstruction loss. After training, our single plain Transformer encoder performs competitively compared to specialized architectures on downstream image and video recognition tasks.
  • Figure 2: OmniMAE on image and video downstream tasks. We finetune the MAE, ST-MAE, and OmniMAE models on image and video benchmarks. We use the ViT architecture with two model sizes: ViT-B and ViT-L. MAE has poor video recognition performance while ST-MAE's performance drops on image datasets. OmniMAE pretraining generalizes to both benchmarks. All models are trained for 800 epochs on the pretraining datasets. The image-only MAE model is inflated carreira2017quo to apply MAE to video recognition tasks. The input image is replicated to apply ST-MAE to image recognition benchmarks.
  • Figure 3: Different types of masking for images (left two) and videos. Causal and tube masking use the data's spatio-temporal structure. Random frame masking randomly masks frames in a video. Random masking randomly masks patches and is used by default for OmniMAE.
  • Figure 4: Reconstruction visualizations using OmniMAE on different video and image datasets. We show the model predictions for varying masking ratios of the input from 75% to 95% and the ground truth reference (Ref). OmniMAE is trained on ImageNet and SSv2 but the predictions generalize to other datasets like K400 and EK100. Please see the supplement for video visualizations.
  • Figure 5: Sample replication. We study the effect of repeating samples while training our model. In each case, we repeat a sample $n$ times within a mini-batch while fixing the overall mini-batch size and training updates. Replication leads to improved training speeds, especially on video without affecting the final performance.
  • ...and 2 more figures