Attention-Guided Masked Autoencoders For Learning Image Representations

Leon Sick; Dominik Engel; Pedro Hermosilla; Timo Ropinski

Attention-Guided Masked Autoencoders For Learning Image Representations

Leon Sick, Dominik Engel, Pedro Hermosilla, Timo Ropinski

TL;DR

The paper addresses unsupervised pre-training of vision transformers using masked autoencoders (MAEs) and the limitation that equal treatment of all image patches under MAEs may overlook object-centric structure. It introduces Attention-Guided Masked Autoencoder (AttG), which uses an unsupervised object-discovery attention map to weight the per-patch reconstruction loss, focusing learning on relevant objects while preserving the MAE masking strategy. The method formalizes a per-patch loss weighting with $M_{scaled} = \exp(M_{norm} / \tau)$ and $L_{AttG} = \gamma \cdot L_{Rec} \cdot M_{scaled}$, with a half-cycle cosine schedule for $\tau$, and demonstrates gains in linear probing, k-NN, few-shot transfer, and robustness across ImageNet and transfer datasets, with minimal computational overhead. The findings suggest a practical route to richer, off-the-shelf representations from pre-training, and point toward future extensions to multi-object semantic guidance and more advanced object-discovery paradigms.

Abstract

Masked autoencoders (MAEs) have established themselves as a powerful method for unsupervised pre-training for computer vision tasks. While vanilla MAEs put equal emphasis on reconstructing the individual parts of the image, we propose to inform the reconstruction process through an attention-guided loss function. By leveraging advances in unsupervised object discovery, we obtain an attention map of the scene which we employ in the loss function to put increased emphasis on reconstructing relevant objects, thus effectively incentivizing the model to learn more object-focused representations without compromising the established masking strategy. Our evaluations show that our pre-trained models learn better latent representations than the vanilla MAE, demonstrated by improved linear probing and k-NN classification results on several benchmarks while at the same time making ViTs more robust against varying backgrounds.

Attention-Guided Masked Autoencoders For Learning Image Representations

TL;DR

and

, with a half-cycle cosine schedule for

, and demonstrates gains in linear probing, k-NN, few-shot transfer, and robustness across ImageNet and transfer datasets, with minimal computational overhead. The findings suggest a practical route to richer, off-the-shelf representations from pre-training, and point toward future extensions to multi-object semantic guidance and more advanced object-discovery paradigms.

Abstract

Paper Structure (20 sections, 3 equations, 3 figures, 11 tables, 1 algorithm)

This paper contains 20 sections, 3 equations, 3 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Method
Learning Architecture
Attention-Guided Loss
Object Discovery
Experiments
ImageNet Evaluation
Comparison to Attention-Masking Add-On Methods
Robustness to Background Changes
Transfer to Other Datasets
Few-Shot Transfer
Image Retrieval
Ablations
Attention Map Quality
...and 5 more sections

Figures (3)

Figure 1: Attention-Guided Reconstruction Loss. We leverage an attention map, obtained from a self-supervised object discovery network, to inform our guided reconstruction loss AttG towards reconstructing the relevant objects in the scene. Through this process, we guide the MAE to learn more effective off-the-shelf representations.
Figure 2: AttG Architecture Overview And Scaling Function. Our architecture, displayed in Figure \ref{['subfig:main']}, employs two streams. Besides the main $MAE$ backbone, we facilitate an unsupervised object discovery ($OD$) stream with fixed weights to obtain a patch-based attention map. This attention map is then scaled and finally used to inform our attention guided reconstruction loss AttG towards relevant objects. Gray overlays depict image patches which are not part of the loss since they have not been masked in the input image. To make the attention map useful for reconstruction guidance, we scale its values, as visualized in Figure \ref{['subfig:scaling']}. This results in the loss for background patches not being masked out and patches from the relevant object in the scene being further emphasized.
Figure 3: Visual Comparison Of Our Attention Maps. We illustrate the differences between our attention maps. Each has been normalized and scaled in the manner as we use them for pre-training our models. From these examples, TokenCut can be perceived to output the best foreground-background separation maps of all presented methods. We study the effect of the different maps on our guidance in Section \ref{['sec:mask-quality']}.

Attention-Guided Masked Autoencoders For Learning Image Representations

TL;DR

Abstract

Attention-Guided Masked Autoencoders For Learning Image Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (3)