Table of Contents
Fetching ...

Improving Masked Autoencoders by Learning Where to Mask

Haijian Chen, Wendong Zhang, Yunbo Wang, Xiaokang Yang

TL;DR

AutoMAE tackles the masking strategy in masked image modeling by learning where to mask. It introduces a differentiable mask generator linked to the MAE via Gumbel-Softmax and trained adversarially to focus on informative foreground patches, balancing information gain with training difficulty. The approach yields strong linear probing, robust finetuning, and notable transfer to downstream tasks, especially with limited data. This end-to-end framework advances self-supervised pretraining by incorporating object-centric priors into the masking process, with practical impact across vision benchmarks.

Abstract

Masked image modeling is a promising self-supervised learning method for visual data. It is typically built upon image patches with random masks, which largely ignores the variation of information density between them. The question is: Is there a better masking strategy than random sampling and how can we learn it? We empirically study this problem and initially find that introducing object-centric priors in mask sampling can significantly improve the learned representations. Inspired by this observation, we present AutoMAE, a fully differentiable framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In this way, our approach can adaptively find patches with higher information density for different images, and further strike a balance between the information gain obtained from image reconstruction and its practical training difficulty. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.

Improving Masked Autoencoders by Learning Where to Mask

TL;DR

AutoMAE tackles the masking strategy in masked image modeling by learning where to mask. It introduces a differentiable mask generator linked to the MAE via Gumbel-Softmax and trained adversarially to focus on informative foreground patches, balancing information gain with training difficulty. The approach yields strong linear probing, robust finetuning, and notable transfer to downstream tasks, especially with limited data. This end-to-end framework advances self-supervised pretraining by incorporating object-centric priors into the masking process, with practical impact across vision benchmarks.

Abstract

Masked image modeling is a promising self-supervised learning method for visual data. It is typically built upon image patches with random masks, which largely ignores the variation of information density between them. The question is: Is there a better masking strategy than random sampling and how can we learn it? We empirically study this problem and initially find that introducing object-centric priors in mask sampling can significantly improve the learned representations. Inspired by this observation, we present AutoMAE, a fully differentiable framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In this way, our approach can adaptively find patches with higher information density for different images, and further strike a balance between the information gain obtained from image reconstruction and its practical training difficulty. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
Paper Structure (36 sections, 4 equations, 8 figures, 9 tables)

This paper contains 36 sections, 4 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparison of masking strategies. (a) The original MAE he2022masked randomly masks $70\%$ image patches with a uniform probability. (b) SemMAE li2022semmae uses a manually designed easy-to-hard masking schedule guided by an independently-trained semantic part indicator. (c) AutoMAE is a fully differentiable framework that uses an adversarially-trained mask generator.
  • Figure 2: Effects of raising the masking probability by $\beta$ on patches within the object bounding boxes. Models are trained on a subset ($10\%$) of the ImageNet dataset. The red dashed line represents the results of using the original random masking strategy.
  • Figure 3: The end-to-end framework of AutoMAE, which is designed to tackle to patch selection dilemma in a fully differentiable manner.
  • Figure 4: High-weighted masks on ImageNet-9 produced by the mask generator. The highlighted areas are obtained from the mask before adding random noise by $K$-largest values.
  • Figure S1: The architecture of the mask discriminator.
  • ...and 3 more figures