Table of Contents
Fetching ...

Attention-Guided Masked Autoencoders For Learning Image Representations

Leon Sick, Dominik Engel, Pedro Hermosilla, Timo Ropinski

TL;DR

The paper addresses unsupervised pre-training of vision transformers using masked autoencoders (MAEs) and the limitation that equal treatment of all image patches under MAEs may overlook object-centric structure. It introduces Attention-Guided Masked Autoencoder (AttG), which uses an unsupervised object-discovery attention map to weight the per-patch reconstruction loss, focusing learning on relevant objects while preserving the MAE masking strategy. The method formalizes a per-patch loss weighting with $M_{scaled} = \exp(M_{norm} / \tau)$ and $L_{AttG} = \gamma \cdot L_{Rec} \cdot M_{scaled}$, with a half-cycle cosine schedule for $\tau$, and demonstrates gains in linear probing, k-NN, few-shot transfer, and robustness across ImageNet and transfer datasets, with minimal computational overhead. The findings suggest a practical route to richer, off-the-shelf representations from pre-training, and point toward future extensions to multi-object semantic guidance and more advanced object-discovery paradigms.

Abstract

Masked autoencoders (MAEs) have established themselves as a powerful method for unsupervised pre-training for computer vision tasks. While vanilla MAEs put equal emphasis on reconstructing the individual parts of the image, we propose to inform the reconstruction process through an attention-guided loss function. By leveraging advances in unsupervised object discovery, we obtain an attention map of the scene which we employ in the loss function to put increased emphasis on reconstructing relevant objects, thus effectively incentivizing the model to learn more object-focused representations without compromising the established masking strategy. Our evaluations show that our pre-trained models learn better latent representations than the vanilla MAE, demonstrated by improved linear probing and k-NN classification results on several benchmarks while at the same time making ViTs more robust against varying backgrounds.

Attention-Guided Masked Autoencoders For Learning Image Representations

TL;DR

The paper addresses unsupervised pre-training of vision transformers using masked autoencoders (MAEs) and the limitation that equal treatment of all image patches under MAEs may overlook object-centric structure. It introduces Attention-Guided Masked Autoencoder (AttG), which uses an unsupervised object-discovery attention map to weight the per-patch reconstruction loss, focusing learning on relevant objects while preserving the MAE masking strategy. The method formalizes a per-patch loss weighting with and , with a half-cycle cosine schedule for , and demonstrates gains in linear probing, k-NN, few-shot transfer, and robustness across ImageNet and transfer datasets, with minimal computational overhead. The findings suggest a practical route to richer, off-the-shelf representations from pre-training, and point toward future extensions to multi-object semantic guidance and more advanced object-discovery paradigms.

Abstract

Masked autoencoders (MAEs) have established themselves as a powerful method for unsupervised pre-training for computer vision tasks. While vanilla MAEs put equal emphasis on reconstructing the individual parts of the image, we propose to inform the reconstruction process through an attention-guided loss function. By leveraging advances in unsupervised object discovery, we obtain an attention map of the scene which we employ in the loss function to put increased emphasis on reconstructing relevant objects, thus effectively incentivizing the model to learn more object-focused representations without compromising the established masking strategy. Our evaluations show that our pre-trained models learn better latent representations than the vanilla MAE, demonstrated by improved linear probing and k-NN classification results on several benchmarks while at the same time making ViTs more robust against varying backgrounds.
Paper Structure (20 sections, 3 equations, 3 figures, 11 tables, 1 algorithm)

This paper contains 20 sections, 3 equations, 3 figures, 11 tables, 1 algorithm.

Figures (3)

  • Figure 1: Attention-Guided Reconstruction Loss. We leverage an attention map, obtained from a self-supervised object discovery network, to inform our guided reconstruction loss AttG towards reconstructing the relevant objects in the scene. Through this process, we guide the MAE to learn more effective off-the-shelf representations.
  • Figure 2: AttG Architecture Overview And Scaling Function. Our architecture, displayed in Figure \ref{['subfig:main']}, employs two streams. Besides the main $MAE$ backbone, we facilitate an unsupervised object discovery ($OD$) stream with fixed weights to obtain a patch-based attention map. This attention map is then scaled and finally used to inform our attention guided reconstruction loss AttG towards relevant objects. Gray overlays depict image patches which are not part of the loss since they have not been masked in the input image. To make the attention map useful for reconstruction guidance, we scale its values, as visualized in Figure \ref{['subfig:scaling']}. This results in the loss for background patches not being masked out and patches from the relevant object in the scene being further emphasized.
  • Figure 3: Visual Comparison Of Our Attention Maps. We illustrate the differences between our attention maps. Each has been normalized and scaled in the manner as we use them for pre-training our models. From these examples, TokenCut can be perceived to output the best foreground-background separation maps of all presented methods. We study the effect of the different maps on our guidance in Section \ref{['sec:mask-quality']}.