Table of Contents
Fetching ...

Beyond [cls]: Exploring the true potential of Masked Image Modeling representations

Marcin Przewięźlikowski, Randall Balestriero, Wojciech Jasiński, Marek Śmieja, Bartosz Zieliński

TL;DR

The paper investigates why Masked Image Modeling (MIM) yields weaker out-of-the-box high-level perception performance than Joint Embedding Architectures (JEAs). By analyzing attention flow in Vision Transformers, it shows MAE-style MIM tends to have the [cls] token attend mainly to itself and distribute attention uniformly across patches, limiting the extraction of salient global information. To address this, the authors introduce Selective Aggregation using a lightweight AbMILP-based weighting of patch tokens, yielding consistently stronger global representations across a range of MIM backbones without retraining the backbone. The approach significantly narrows the performance gap on ImageNet-1k and improves low-shot and fine-grained tasks, suggesting that proper aggregation of patch information is a critical factor for MIM’s practical effectiveness. This work provides a lightweight, model-agnostic tool to enhance MIM representations and offers guidance for future SSL developments toward more selective information integration in vision transformers.

Abstract

Masked Image Modeling (MIM) has emerged as a promising approach for Self-Supervised Learning (SSL) of visual representations. However, the out-of-the-box performance of MIMs is typically inferior to competing approaches. Most users cannot afford fine-tuning due to the need for large amounts of data, high GPU consumption, and specialized user knowledge. Therefore, the practical use of MIM representations is limited. In this paper we ask what is the reason for the poor out-of-the-box performance of MIMs. Is it due to weaker features produced by MIM models, or is it due to suboptimal usage? Through detailed analysis, we show that attention in MIMs is spread almost uniformly over many patches, leading to ineffective aggregation by the [cls] token. Based on this insight, we propose Selective Aggregation to better capture the rich semantic information retained in patch tokens, which significantly improves the out-of-the-box performance of MIM.

Beyond [cls]: Exploring the true potential of Masked Image Modeling representations

TL;DR

The paper investigates why Masked Image Modeling (MIM) yields weaker out-of-the-box high-level perception performance than Joint Embedding Architectures (JEAs). By analyzing attention flow in Vision Transformers, it shows MAE-style MIM tends to have the [cls] token attend mainly to itself and distribute attention uniformly across patches, limiting the extraction of salient global information. To address this, the authors introduce Selective Aggregation using a lightweight AbMILP-based weighting of patch tokens, yielding consistently stronger global representations across a range of MIM backbones without retraining the backbone. The approach significantly narrows the performance gap on ImageNet-1k and improves low-shot and fine-grained tasks, suggesting that proper aggregation of patch information is a critical factor for MIM’s practical effectiveness. This work provides a lightweight, model-agnostic tool to enhance MIM representations and offers guidance for future SSL developments toward more selective information integration in vision transformers.

Abstract

Masked Image Modeling (MIM) has emerged as a promising approach for Self-Supervised Learning (SSL) of visual representations. However, the out-of-the-box performance of MIMs is typically inferior to competing approaches. Most users cannot afford fine-tuning due to the need for large amounts of data, high GPU consumption, and specialized user knowledge. Therefore, the practical use of MIM representations is limited. In this paper we ask what is the reason for the poor out-of-the-box performance of MIMs. Is it due to weaker features produced by MIM models, or is it due to suboptimal usage? Through detailed analysis, we show that attention in MIMs is spread almost uniformly over many patches, leading to ineffective aggregation by the [cls] token. Based on this insight, we propose Selective Aggregation to better capture the rich semantic information retained in patch tokens, which significantly improves the out-of-the-box performance of MIM.

Paper Structure

This paper contains 51 sections, 6 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: The standard approaches used to obtain global representations in Masked Image Modeling (MIM) -- [cls] token or naive averaging over patch tokens -- do not focus on the most relevant image fragments, resulting in poor out-of-the-box performance. As a remedy, we propose Selective Aggregation -- a lightweight approach that dynamically selects relevant tokens, thereby improving performance.
  • Figure 2: ViTs trained with Joint-Embedding Architectures (JEA) attend to semantically rich patches while forming global [cls] representations, which is critical for perception performance. At the same time, ViTs trained with Masked Image Modeling (MIM) attend more uniformly to all patches, absorbing both relevant and irrelevant information and achieving an effect similar to naive average pooling (see left and center). To improve out-of-the-box MIM performance, we propose Selective Aggregation (see right) -- a mechanism that aggregates patch tokens according to their relevance, as quantified by a lightweight linear regressor ().
  • Figure 3: Attention of the [cls] token to itself is much higher in MAE, than in the JEA ViTs. As opposed to JEA, where the [cls] tokens gather a large amount of information from the patch tokens, the MAE [cls] tokens primarily recycles its own representation.
  • Figure 4: Entropy of [cls] token attention to patch tokens reaches almost the maximal possible level in MAE. In other models, it decreases in the deeper model blocks, indicating that the [cls] token attends to different patches in a more selective manner. Fine-tuning of MAE decreases this entropy, indicating that selective attention to patch tokens is crucial for good perception.
  • Figure 5: Attention of the patch tokens to themselves, relative to the total attention given to all patch tokens. In the later MAE blocks, patch tokens seem to allocate more relative attention to themselves, compared to JEA.
  • ...and 8 more figures