Table of Contents
Fetching ...

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Jefferson Hernandez, Ruben Villegas, Vicente Ordonez

TL;DR

It is shown that visual representations learned under ViC-MAE generalize well to both video and image classification tasks, and maintains a balanced transfer-learning performance between video and image classification benchmarks.

Abstract

We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and contrastive learning. ViC-MAE is trained using a global featured obtained by pooling the local representations learned under an MAE reconstruction loss and leveraging this representation under a contrastive objective across images and video frames. We show that visual representations learned under ViC-MAE generalize well to both video and image classification tasks. Particularly, ViC-MAE obtains state-of-the-art transfer learning performance from video to images on Imagenet-1k compared to the recently proposed OmniMAE by achieving a top-1 accuracy of 86% (+1.3% absolute improvement) when trained on the same data and 87.1% (+2.4% absolute improvement) when training on extra data. At the same time ViC-MAE outperforms most other methods on video benchmarks by obtaining 75.9% top-1 accuracy on the challenging Something something-v2 video benchmark . When training on videos and images from a diverse combination of datasets, our method maintains a balanced transfer-learning performance between video and image classification benchmarks, coming only as a close second to the best supervised method.

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

TL;DR

It is shown that visual representations learned under ViC-MAE generalize well to both video and image classification tasks, and maintains a balanced transfer-learning performance between video and image classification benchmarks.

Abstract

We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and contrastive learning. ViC-MAE is trained using a global featured obtained by pooling the local representations learned under an MAE reconstruction loss and leveraging this representation under a contrastive objective across images and video frames. We show that visual representations learned under ViC-MAE generalize well to both video and image classification tasks. Particularly, ViC-MAE obtains state-of-the-art transfer learning performance from video to images on Imagenet-1k compared to the recently proposed OmniMAE by achieving a top-1 accuracy of 86% (+1.3% absolute improvement) when trained on the same data and 87.1% (+2.4% absolute improvement) when training on extra data. At the same time ViC-MAE outperforms most other methods on video benchmarks by obtaining 75.9% top-1 accuracy on the challenging Something something-v2 video benchmark . When training on videos and images from a diverse combination of datasets, our method maintains a balanced transfer-learning performance between video and image classification benchmarks, coming only as a close second to the best supervised method.
Paper Structure (26 sections, 9 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 26 sections, 9 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: ViC-MAE operates over video frames and images using masked image modeling at the image and frame level and contrastive learning at the temporal level for videos and under image transformations for images. Our model represents a strong backbone for both image and video tasks.
  • Figure 2: ViC-MAE inputs two distant frames from a video or two different views of an image within the same batch using a siamese backbone (shared weights), and randomly masks them, before passing them through a ViT model which learns a representation of local features using masked image modeling. A global representation of the video is then constructed by global pooling of the local features learned by the ViT model trained to reconstruct individual patches using an $\ell_2$ loss. A standard predictor and a target encoder are used with a contrastive loss. Our use of an aggregation layer before the predictor network aids in avoiding the collapse of the learned global representations.
  • Figure 3: Additional comparisons with the state-of-the-art and recently proposed methods.
  • Figure : ViC-MAE PyTorch pseudocode.