Table of Contents
Fetching ...

MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers

Jihao Liu, Xin Huang, Jinliang Zheng, Yu Liu, Hongsheng Li

TL;DR

MixMAE tackles the inefficiency and pretraining–finetuning inconsistency of masked image modeling for hierarchical Vision Transformers by replacing masked tokens with visible tokens from a second image and performing dual reconstruction. The encoder-decoder framework uses a hierarchical Swin Transformer as the encoder with a lightweight decoder, enabling efficient pretraining while preserving multi-scale representations. Empirical results show strong ImageNet-1K performance (e.g., 85.1% top-1 with Swin-B/W14 at 600 epochs) and favorable transfer to COCO and ADE20K, with better FLOPs-perf tradeoffs than prior MIM methods. The approach generalizes across backbone scales and downstream tasks, offering a practical path for efficient large-scale pretraining of hierarchical vision models.

Abstract

In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers. Existing masked image modeling (MIM) methods for hierarchical Vision Transformers replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes pretraining-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). On the other hand, MAE does not introduce [MASK] tokens at its encoder at all but is not applicable for hierarchical Vision Transformers. To solve the issue and accelerate the pretraining of hierarchical models, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the two original images from the mixed input, which significantly improves efficiency. While MixMAE can be applied to various hierarchical Transformers, this paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Empirical results demonstrate that MixMAE can learn high-quality visual representations efficiently. Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transfer performances on the other 6 datasets show that MixMAE has better FLOPs / performance tradeoff than previous popular MIM methods. Code is available at https://github.com/Sense-X/MixMIM.

MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers

TL;DR

MixMAE tackles the inefficiency and pretraining–finetuning inconsistency of masked image modeling for hierarchical Vision Transformers by replacing masked tokens with visible tokens from a second image and performing dual reconstruction. The encoder-decoder framework uses a hierarchical Swin Transformer as the encoder with a lightweight decoder, enabling efficient pretraining while preserving multi-scale representations. Empirical results show strong ImageNet-1K performance (e.g., 85.1% top-1 with Swin-B/W14 at 600 epochs) and favorable transfer to COCO and ADE20K, with better FLOPs-perf tradeoffs than prior MIM methods. The approach generalizes across backbone scales and downstream tasks, offering a practical path for efficient large-scale pretraining of hierarchical vision models.

Abstract

In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers. Existing masked image modeling (MIM) methods for hierarchical Vision Transformers replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes pretraining-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). On the other hand, MAE does not introduce [MASK] tokens at its encoder at all but is not applicable for hierarchical Vision Transformers. To solve the issue and accelerate the pretraining of hierarchical models, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the two original images from the mixed input, which significantly improves efficiency. While MixMAE can be applied to various hierarchical Transformers, this paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Empirical results demonstrate that MixMAE can learn high-quality visual representations efficiently. Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transfer performances on the other 6 datasets show that MixMAE has better FLOPs / performance tradeoff than previous popular MIM methods. Code is available at https://github.com/Sense-X/MixMIM.
Paper Structure (20 sections, 2 equations, 5 figures, 19 tables)

This paper contains 20 sections, 2 equations, 5 figures, 19 tables.

Figures (5)

  • Figure 1: Overview of MixMAE. For pretraining, two images are mixed with a random mixing mask to create a mixed image. MixMAE takes the mixed image as input and reconstructs the two original images. Right before decoding, the token embeddings are unmixed and filled with mask tokens for dual reconstruction of the two original images.
  • Figure 2: Tradeoffs of FLOPs vs. (left) top-1 accuracy on ImageNet-1K, (middle) APbox on COCO, (right) and mIoU on ADE20K. All results are from various self-supervised pretraining methods followed by supervised finetuning. All entries on COCO coco use Mask RCNN he2017mask framework. All entries on ADE20K ade20k use UperNet upernet framework. Note that this comparison confounds differences in architecture and pretraining strategy.
  • Figure 3: Efficiency comparison between MixMAE and SimMIM. We report the finetuning accuracy on ImageNet-1K. The encoder is Swin-B/W14 with input size of $224\times224$.
  • Figure 4: Examples images for different filling contents.
  • Figure 5: Mixed convolution.