Table of Contents
Fetching ...

Learning Mask Invariant Mutual Information for Masked Image Modeling

Tao Huang, Yanxiang Ma, Shan You, Chang Xu

TL;DR

This work reframes masked image modeling through the information bottleneck (IB) lens, arguing that MAEs succeed by preserving maximal relevant information while discarding irrelevant details in latent representations. It introduces MI-MAE, which optimizes two mutual-information-based losses: (i) maximizing the mutual information between latent features across orthogonal masks via InfoNCE, and (ii) minimizing the mutual information between latent features and inputs using a CLUB-based bound with an approximation network. The approach yields clear performance gains on ImageNet classification (e.g., up to 84.1% top-1 after 400 epochs) and improves downstream tasks such as COCO object detection and ADE20K semantic segmentation, validating the IB framework for MAEs. The paper also provides theoretical analyses of information distortion and latent-variable bounds, offering a principled pathway to design more powerful self-supervised vision models.

Abstract

Masked autoencoders (MAEs) represent a prominent self-supervised learning paradigm in computer vision. Despite their empirical success, the underlying mechanisms of MAEs remain insufficiently understood. Recent studies have attempted to elucidate the functioning of MAEs through contrastive learning and feature representation analysis, yet these approaches often provide only implicit insights. In this paper, we propose a new perspective for understanding MAEs by leveraging the information bottleneck principle in information theory. Our theoretical analyses reveal that optimizing the latent features to balance relevant and irrelevant information is key to improving MAE performance. Building upon our proofs, we introduce MI-MAE, a novel method that optimizes MAEs through mutual information maximization and minimization. By enhancing latent features to retain maximal relevant information between them and the output, and minimizing irrelevant information between them and the input, our approach achieves better performance. Extensive experiments on standard benchmarks show that MI-MAE significantly outperforms MAE models in tasks such as image classification, object detection, and semantic segmentation. Our findings validate the theoretical framework and highlight the practical advantages of applying the information bottleneck principle to MAEs, offering deeper insights for developing more powerful self-supervised learning models.

Learning Mask Invariant Mutual Information for Masked Image Modeling

TL;DR

This work reframes masked image modeling through the information bottleneck (IB) lens, arguing that MAEs succeed by preserving maximal relevant information while discarding irrelevant details in latent representations. It introduces MI-MAE, which optimizes two mutual-information-based losses: (i) maximizing the mutual information between latent features across orthogonal masks via InfoNCE, and (ii) minimizing the mutual information between latent features and inputs using a CLUB-based bound with an approximation network. The approach yields clear performance gains on ImageNet classification (e.g., up to 84.1% top-1 after 400 epochs) and improves downstream tasks such as COCO object detection and ADE20K semantic segmentation, validating the IB framework for MAEs. The paper also provides theoretical analyses of information distortion and latent-variable bounds, offering a principled pathway to design more powerful self-supervised vision models.

Abstract

Masked autoencoders (MAEs) represent a prominent self-supervised learning paradigm in computer vision. Despite their empirical success, the underlying mechanisms of MAEs remain insufficiently understood. Recent studies have attempted to elucidate the functioning of MAEs through contrastive learning and feature representation analysis, yet these approaches often provide only implicit insights. In this paper, we propose a new perspective for understanding MAEs by leveraging the information bottleneck principle in information theory. Our theoretical analyses reveal that optimizing the latent features to balance relevant and irrelevant information is key to improving MAE performance. Building upon our proofs, we introduce MI-MAE, a novel method that optimizes MAEs through mutual information maximization and minimization. By enhancing latent features to retain maximal relevant information between them and the output, and minimizing irrelevant information between them and the input, our approach achieves better performance. Extensive experiments on standard benchmarks show that MI-MAE significantly outperforms MAE models in tasks such as image classification, object detection, and semantic segmentation. Our findings validate the theoretical framework and highlight the practical advantages of applying the information bottleneck principle to MAEs, offering deeper insights for developing more powerful self-supervised learning models.

Paper Structure

This paper contains 22 sections, 34 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Pipeline of MI-MAE for each mask $m_k$. We introduces two losses $l_{i,k}^{(\mathrm{max\_mi})}$ and $l_{i}^{(\mathrm{min\_mi})}$ on the latency to maximize the relevant information and minimize the irrelevant information respectively, and $\mathcal{L}_\mathrm{rec}$ denotes the original MAE loss. The top sequence in the figure denotes forward propagation and the bottom denotes back propagation. $m$ denotes mask and $X$ denotes the original map. $\gamma$ is the inverse function of a decoder, $\zeta$ is the output of the reduced target map of the MAE on $\gamma$, and $Z$ is defined as a latent feature on a small neighbourhood of $\zeta$, and their bias $\varepsilon_z$ is decided by $\epsilon_y$. $\nabla$ in backpropagation represents gradients, while $\nabla_h$ is the gradient in layer $h$ of the encoder.
  • Figure 2: Ablation of masking ratios.
  • Figure 3: The curve of reconstruction loss $\mathcal{L}_\mathrm{rec}$ during the $400$-epoch training of MI-MAE. We set $\epsilon_l$ to 0.5.