Table of Contents
Fetching ...

i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?

Kevin Zhang, Zhiqiang Shen

TL;DR

The paper tackles why Masked Image Modeling pretraining yields strong downstream performance by examining latent representation properties. It introduces i-MAE, which adds a mixture-based two-branch reconstruction with latent feature distillation and a semantics-enhanced sampling strategy to MAE, enabling analysis of linear separability and semantic content in the latent space. Through quantitative measures (linear regression-based separability metrics) and semantics-focused ablations across CIFAR-10/100, Tiny-ImageNet, and ImageNet-1K, i-MAE demonstrates improved linear separability and richer semantics, translating into better finetuning and linear probing performance. The work provides not only improved representations but also a framework to understand and improve MAE-based self-supervised pretraining, with potential impact on how latent features are analyzed and enhanced in MIM methods.

Abstract

Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain. However, the mechanism and properties of the learned representations by such a scheme, as well as how to further enhance the representations are so far not well-explored. In this paper, we aim to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability from two aspects: (1) employing a two-way image reconstruction and a latent feature reconstruction with distillation loss to learn better features; (2) proposing a semantics-enhanced sampling strategy to boost the learned semantics in MAE. Upon the proposed i-MAE architecture, we can address two critical questions to explore the behaviors of the learned representations in MAE: (1) Whether the separability of latent representations in Masked Autoencoders is helpful for model performance? We study it by forcing the input as a mixture of two images instead of one. (2) Whether we can enhance the representations in the latent feature space by controlling the degree of semantics during sampling on Masked Autoencoders? To this end, we propose a sampling strategy within a mini-batch based on the semantics of training samples to examine this aspect. Extensive experiments are conducted on CIFAR-10/100, Tiny-ImageNet and ImageNet-1K to verify the observations we discovered. Furthermore, in addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space by proposing two evaluation schemes. The surprising and consistent results demonstrate that i-MAE is a superior framework design for understanding MAE frameworks, as well as achieving better representational ability. Code is available at https://github.com/vision-learning-acceleration-lab/i-mae.

i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?

TL;DR

The paper tackles why Masked Image Modeling pretraining yields strong downstream performance by examining latent representation properties. It introduces i-MAE, which adds a mixture-based two-branch reconstruction with latent feature distillation and a semantics-enhanced sampling strategy to MAE, enabling analysis of linear separability and semantic content in the latent space. Through quantitative measures (linear regression-based separability metrics) and semantics-focused ablations across CIFAR-10/100, Tiny-ImageNet, and ImageNet-1K, i-MAE demonstrates improved linear separability and richer semantics, translating into better finetuning and linear probing performance. The work provides not only improved representations but also a framework to understand and improve MAE-based self-supervised pretraining, with potential impact on how latent features are analyzed and enhanced in MIM methods.

Abstract

Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain. However, the mechanism and properties of the learned representations by such a scheme, as well as how to further enhance the representations are so far not well-explored. In this paper, we aim to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability from two aspects: (1) employing a two-way image reconstruction and a latent feature reconstruction with distillation loss to learn better features; (2) proposing a semantics-enhanced sampling strategy to boost the learned semantics in MAE. Upon the proposed i-MAE architecture, we can address two critical questions to explore the behaviors of the learned representations in MAE: (1) Whether the separability of latent representations in Masked Autoencoders is helpful for model performance? We study it by forcing the input as a mixture of two images instead of one. (2) Whether we can enhance the representations in the latent feature space by controlling the degree of semantics during sampling on Masked Autoencoders? To this end, we propose a sampling strategy within a mini-batch based on the semantics of training samples to examine this aspect. Extensive experiments are conducted on CIFAR-10/100, Tiny-ImageNet and ImageNet-1K to verify the observations we discovered. Furthermore, in addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space by proposing two evaluation schemes. The surprising and consistent results demonstrate that i-MAE is a superior framework design for understanding MAE frameworks, as well as achieving better representational ability. Code is available at https://github.com/vision-learning-acceleration-lab/i-mae.
Paper Structure (17 sections, 7 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 7 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Reconstruction results from the subordinate branch of i-MAE on ImageNet-1K validation images with different mixing coefficients ${\bm \alpha}$ (listed on the left). i-MAE is pre-trained with linearly mixed input reconstruction loss on both inputs. Visually, i-MAE predictions reflect features of ${\textbf{I}_s}$ even at low mixture coefficients and still reconstruct the subordinate image well, whereas at 0.45 which is more challenging, the reconstructions show elements like colors from the dominant image but the content still matches the target. More visualizations are provided in Appendix.
  • Figure 2: Framework overview of our i-MAE. ① is the main branch that consists of a mixture encoder, a disentanglement module, and a two-way image reconstruction module. ② is the encoder part of a pre-trained vanilla MAE for distillation purposes (i.e., latent reconstruction). ③ is the decoder part in MAE and is discarded in training.
  • Figure 3: Qualitative ablation comparisons on CIFAR-10. Row (a): baseline vanilla MAE; (b): MAE with unmixed input; (c): our i-MAE without distillation; and (d): i-MAE with distillation.
  • Figure 4: Comparisons between mask ratios on Tiny-ImageNet validation set. i-MAE produces enhanced visual reconstructions from lower masking ratios when reconstructing images.
  • Figure 5: Left is the comparison of weight distribution between MAE and i-MAE pre-trained on ImageNet-1K. Our weights are sparser. Right is the comparison of attention maps. Our model has a wider view on the input image which encodes more information.
  • ...and 5 more figures