Table of Contents
Fetching ...

Contrastive Masked Autoencoders are Stronger Vision Learners

Zhicheng Huang, Xiaojie Jin, Chengze Lu, Qibin Hou, Ming-Ming Cheng, Dongmei Fu, Xiaohui Shen, Jiashi Feng

TL;DR

CMAE advances self-supervised vision learning by unifying masked image modeling with contrastive learning through a two-branch architecture and two novel design components: a feature decoder to align masked features and a pixel-shifting view generator to create plausible positive pairs. The framework jointly optimizes a reconstruction loss and an InfoNCE-based contrastive loss, enabling simultaneous holistic and discriminative feature learning. Empirically, CMAE achieves state-of-the-art results across ImageNet-1K classification, ADE20K semantic segmentation, and COCO object detection, with CMAE-Base reaching 85.3% top-1 on ImageNet and 52.5% mIoU on ADE20K, and demonstrates strong transferability and scalability. The approach offers a practical path to more discriminative and generalizable vision representations, with open-source code available for reproduction.

Abstract

Masked image modeling (MIM) has achieved promising results on various vision tasks. However, the limited discriminability of learned representation manifests there is still plenty to go for making a stronger vision learner. Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new self-supervised pre-training method for learning more comprehensive and capable vision representations. By elaboratively unifying contrastive learning (CL) and masked image model (MIM) through novel designs, CMAE leverages their respective advantages and learns representations with both strong instance discriminability and local perceptibility. Specifically, CMAE consists of two branches where the online branch is an asymmetric encoder-decoder and the momentum branch is a momentum updated encoder. During training, the online encoder reconstructs original images from latent representations of masked images to learn holistic features. The momentum encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart. To make CL compatible with MIM, CMAE introduces two new components, i.e. pixel shifting for generating plausible positive views and feature decoder for complementing features of contrastive pairs. Thanks to these novel designs, CMAE effectively improves the representation quality and transfer performance over its MIM counterpart. CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection. Notably, CMAE-Base achieves $85.3\%$ top-1 accuracy on ImageNet and $52.5\%$ mIoU on ADE20k, surpassing previous best results by $0.7\%$ and $1.8\%$ respectively. The source code is publicly accessible at \url{https://github.com/ZhichengHuang/CMAE}.

Contrastive Masked Autoencoders are Stronger Vision Learners

TL;DR

CMAE advances self-supervised vision learning by unifying masked image modeling with contrastive learning through a two-branch architecture and two novel design components: a feature decoder to align masked features and a pixel-shifting view generator to create plausible positive pairs. The framework jointly optimizes a reconstruction loss and an InfoNCE-based contrastive loss, enabling simultaneous holistic and discriminative feature learning. Empirically, CMAE achieves state-of-the-art results across ImageNet-1K classification, ADE20K semantic segmentation, and COCO object detection, with CMAE-Base reaching 85.3% top-1 on ImageNet and 52.5% mIoU on ADE20K, and demonstrates strong transferability and scalability. The approach offers a practical path to more discriminative and generalizable vision representations, with open-source code available for reproduction.

Abstract

Masked image modeling (MIM) has achieved promising results on various vision tasks. However, the limited discriminability of learned representation manifests there is still plenty to go for making a stronger vision learner. Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new self-supervised pre-training method for learning more comprehensive and capable vision representations. By elaboratively unifying contrastive learning (CL) and masked image model (MIM) through novel designs, CMAE leverages their respective advantages and learns representations with both strong instance discriminability and local perceptibility. Specifically, CMAE consists of two branches where the online branch is an asymmetric encoder-decoder and the momentum branch is a momentum updated encoder. During training, the online encoder reconstructs original images from latent representations of masked images to learn holistic features. The momentum encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart. To make CL compatible with MIM, CMAE introduces two new components, i.e. pixel shifting for generating plausible positive views and feature decoder for complementing features of contrastive pairs. Thanks to these novel designs, CMAE effectively improves the representation quality and transfer performance over its MIM counterpart. CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection. Notably, CMAE-Base achieves top-1 accuracy on ImageNet and mIoU on ADE20k, surpassing previous best results by and respectively. The source code is publicly accessible at \url{https://github.com/ZhichengHuang/CMAE}.
Paper Structure (16 sections, 8 equations, 9 figures, 6 tables)

This paper contains 16 sections, 8 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Overview of CMAE. CMAE improves over its MIM counterpart by leveraging contrastive learning through novel designs. To make contrastive learning compatible with MIM, we propose a feature decoder to complement the masked features and a weakly spatial shifting augmentation method for generating plausible contrastive views.
  • Figure 2: Comparisons with previous state-of-the-art MIM methods on ImageNet-1K in terms of top-1 accuracy at different pre-training epochs.
  • Figure 3: Overall pipeline. Our method contains three components: the online encoder, momentum encoder, and online decoder. Given a training image, it applies pixel shifting to generate different views, which are then fed into the online and momentum encoders respectively. The online encoder randomly masks a fraction of the image patches and operates on the visible ones. The momentum encoder operates on the whole view after pixel shifting. The pixel decoder learns to reconstruct the input image from the image tokens (along with MASK tokens) provided by the online encoder, while the feature decoder learns to predict the features of the input image for contrastive learning with the momentum encoder output features. During pre-training, the parameters of the momentum encoder and projection head are updated using an exponential moving average algorithm. After the pre-training, only the online encoder is kept for downstream applications.
  • Figure 4: Partial fine-tuning results using the ViT-B backbone.
  • Figure 5: Scaling results with different model sizes.
  • ...and 4 more figures