Improving Masked Autoencoders by Learning Where to Mask
Haijian Chen, Wendong Zhang, Yunbo Wang, Xiaokang Yang
TL;DR
AutoMAE tackles the masking strategy in masked image modeling by learning where to mask. It introduces a differentiable mask generator linked to the MAE via Gumbel-Softmax and trained adversarially to focus on informative foreground patches, balancing information gain with training difficulty. The approach yields strong linear probing, robust finetuning, and notable transfer to downstream tasks, especially with limited data. This end-to-end framework advances self-supervised pretraining by incorporating object-centric priors into the masking process, with practical impact across vision benchmarks.
Abstract
Masked image modeling is a promising self-supervised learning method for visual data. It is typically built upon image patches with random masks, which largely ignores the variation of information density between them. The question is: Is there a better masking strategy than random sampling and how can we learn it? We empirically study this problem and initially find that introducing object-centric priors in mask sampling can significantly improve the learned representations. Inspired by this observation, we present AutoMAE, a fully differentiable framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In this way, our approach can adaptively find patches with higher information density for different images, and further strike a balance between the information gain obtained from image reconstruction and its practical training difficulty. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
