Table of Contents
Fetching ...

Suppressing Non-Semantic Noise in Masked Image Modeling Representations

Martine Hjelkrem-Tan, Marius Aasan, Rwiddhi Chakraborty, Gabriel Y. Arteaga, Changkyu Choi, Adín Ramírez Rivera

Abstract

Masked Image Modeling (MIM) has become a ubiquitous self-supervised vision paradigm. In this work, we show that MIM objectives cause the learned representations to retain non-semantic information, which ultimately hurts performance during inference. We introduce a model-agnostic score for semantic invariance using Principal Component Analysis (PCA) on real and synthetic non-semantic images. Based on this score, we propose a simple method, Semantically Orthogonal Artifact Projection (SOAP), to directly suppress non-semantic information in patch representations, leading to consistent improvements in zero-shot performance across various MIM-based models. SOAP is a post-hoc suppression method, requires zero training, and can be attached to any model as a single linear head.

Suppressing Non-Semantic Noise in Masked Image Modeling Representations

Abstract

Masked Image Modeling (MIM) has become a ubiquitous self-supervised vision paradigm. In this work, we show that MIM objectives cause the learned representations to retain non-semantic information, which ultimately hurts performance during inference. We introduce a model-agnostic score for semantic invariance using Principal Component Analysis (PCA) on real and synthetic non-semantic images. Based on this score, we propose a simple method, Semantically Orthogonal Artifact Projection (SOAP), to directly suppress non-semantic information in patch representations, leading to consistent improvements in zero-shot performance across various MIM-based models. SOAP is a post-hoc suppression method, requires zero training, and can be attached to any model as a single linear head.

Paper Structure

This paper contains 42 sections, 15 equations, 15 figures, 15 tables.

Figures (15)

  • Figure 1: Pipeline overview; a pretrained MIM encoder outputs dense representations $z$ which are used for downstream tasks---we show salient segmentation as an example. By identifying and suppressing principal components encoding positional noise, our SOAP module improves the representations $\hat{z}$ and enhances downstream performance in zero-shot settings.
  • Figure 2: Representations from MIM models exhibit strong positional bias in leading principal components (PC), illustrated here by the response heatmap of selected PCs for one example image. MIM models show clear left/right and top/bottom bias. This behavior is not observed for non-MIM models. In Section \ref{['sec:method']}, we propose a novel method to automatically isolate such positional bias in a post-hoc fashion.
  • Figure 3: Plot of the semantic invariance (SI) score (viridis). The SI-score increases when $P\approx Q$ and the probabilities are confident (close to $0$ or $1$). In the uncertain case $P\approx0.5\approx Q$, the score is lower to reflect ambiguity of semantic invariance. For comparison we also show the Dice-Sørensen coefficient (gray), which does not have this property, and thus is unable to capture the uncertainty.
  • Figure 4: Semantic Invariance (SI) scores reveal positional bias in top-ranked principal components. Real and Synth columns show activations for real and synthetic images, respectively. The top two components encode left/right and top/bottom biases (see \ref{['fig:tblr_response']}), which diminish in lower-ranked components. MIM models exhibit clear semantic invariance, whereas the non-MIM model (DINO) does not. See \ref{['fig:activations_bigfig']} for additional models and examples.
  • Figure 5: Semantic invariance ($\text{SI}$) score in descending order. All scores are shown in the left plot, while the right focuses on the top 10 semantically invariant scores. Note that all MIM-models have a max-score $\geq 0.75$, while all non MIM models have a lower score.
  • ...and 10 more figures