Attention-Guided Masked Autoencoders For Learning Image Representations
Leon Sick, Dominik Engel, Pedro Hermosilla, Timo Ropinski
TL;DR
The paper addresses unsupervised pre-training of vision transformers using masked autoencoders (MAEs) and the limitation that equal treatment of all image patches under MAEs may overlook object-centric structure. It introduces Attention-Guided Masked Autoencoder (AttG), which uses an unsupervised object-discovery attention map to weight the per-patch reconstruction loss, focusing learning on relevant objects while preserving the MAE masking strategy. The method formalizes a per-patch loss weighting with $M_{scaled} = \exp(M_{norm} / \tau)$ and $L_{AttG} = \gamma \cdot L_{Rec} \cdot M_{scaled}$, with a half-cycle cosine schedule for $\tau$, and demonstrates gains in linear probing, k-NN, few-shot transfer, and robustness across ImageNet and transfer datasets, with minimal computational overhead. The findings suggest a practical route to richer, off-the-shelf representations from pre-training, and point toward future extensions to multi-object semantic guidance and more advanced object-discovery paradigms.
Abstract
Masked autoencoders (MAEs) have established themselves as a powerful method for unsupervised pre-training for computer vision tasks. While vanilla MAEs put equal emphasis on reconstructing the individual parts of the image, we propose to inform the reconstruction process through an attention-guided loss function. By leveraging advances in unsupervised object discovery, we obtain an attention map of the scene which we employ in the loss function to put increased emphasis on reconstructing relevant objects, thus effectively incentivizing the model to learn more object-focused representations without compromising the established masking strategy. Our evaluations show that our pre-trained models learn better latent representations than the vanilla MAE, demonstrated by improved linear probing and k-NN classification results on several benchmarks while at the same time making ViTs more robust against varying backgrounds.
