Table of Contents
Fetching ...

PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling

Junmyeong Lee, Eui Jun Hwang, Sukmin Cho, Jong C. Park

TL;DR

PiLaMIM tackles the complementary strengths of Pixel MIM and Latent MIM by unifying them in a single framework with a shared context encoder, two decoders, and the CLS token to capture global context. It optimizes a combined objective $L = L_{pixel} + L_{latent} + L_{cls}$ to reconstruct both pixel values and latent representations. Empirically, PiLaMIM outperforms MAE, I-JEPA, and BootMAE on both high-level semantic tasks (e.g., CIFAR, iNaturalist, Places) and low-level tasks (Clevr/Count, Clevr/Dist), demonstrating richer visual representations. The results highlight the value of combining pixel- and latent-level cues and confirm the CLS token’s role in enhancing global semantic information for versatile visual understanding.

Abstract

In Masked Image Modeling (MIM), two primary methods exist: Pixel MIM and Latent MIM, each utilizing different reconstruction targets, raw pixels and latent representations, respectively. Pixel MIM tends to capture low-level visual details such as color and texture, while Latent MIM focuses on high-level semantics of an object. However, these distinct strengths of each method can lead to suboptimal performance in tasks that rely on a particular level of visual features. To address this limitation, we propose PiLaMIM, a unified framework that combines Pixel MIM and Latent MIM to integrate their complementary strengths. Our method uses a single encoder along with two distinct decoders: one for predicting pixel values and another for latent representations, ensuring the capture of both high-level and low-level visual features. We further integrate the CLS token into the reconstruction process to aggregate global context, enabling the model to capture more semantic information. Extensive experiments demonstrate that PiLaMIM outperforms key baselines such as MAE, I-JEPA and BootMAE in most cases, proving its effectiveness in extracting richer visual representations.

PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling

TL;DR

PiLaMIM tackles the complementary strengths of Pixel MIM and Latent MIM by unifying them in a single framework with a shared context encoder, two decoders, and the CLS token to capture global context. It optimizes a combined objective to reconstruct both pixel values and latent representations. Empirically, PiLaMIM outperforms MAE, I-JEPA, and BootMAE on both high-level semantic tasks (e.g., CIFAR, iNaturalist, Places) and low-level tasks (Clevr/Count, Clevr/Dist), demonstrating richer visual representations. The results highlight the value of combining pixel- and latent-level cues and confirm the CLS token’s role in enhancing global semantic information for versatile visual understanding.

Abstract

In Masked Image Modeling (MIM), two primary methods exist: Pixel MIM and Latent MIM, each utilizing different reconstruction targets, raw pixels and latent representations, respectively. Pixel MIM tends to capture low-level visual details such as color and texture, while Latent MIM focuses on high-level semantics of an object. However, these distinct strengths of each method can lead to suboptimal performance in tasks that rely on a particular level of visual features. To address this limitation, we propose PiLaMIM, a unified framework that combines Pixel MIM and Latent MIM to integrate their complementary strengths. Our method uses a single encoder along with two distinct decoders: one for predicting pixel values and another for latent representations, ensuring the capture of both high-level and low-level visual features. We further integrate the CLS token into the reconstruction process to aggregate global context, enabling the model to capture more semantic information. Extensive experiments demonstrate that PiLaMIM outperforms key baselines such as MAE, I-JEPA and BootMAE in most cases, proving its effectiveness in extracting richer visual representations.
Paper Structure (19 sections, 5 equations, 2 figures, 5 tables)

This paper contains 19 sections, 5 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: An overview of the PiLaMIM framework. The context encoder $f_{\textnormal{context}}$ processes the visible patches $X_{\mathcal{V}}$ to produce encoded tokens $Z_{\mathcal{V}}$. These tokens are then fed into two decoders: the pixel decoder $g_{\textnormal{pixel}}$ and $g_{\textnormal{latent}}$, resulting in predicted pixel values $\hat{X}$ and latent representations $\hat{T}$, respectively. The target encoder $f_{\textnormal{target}}$ processes the patched image $X$ to provide target latent representation $T$. For training, we define three loss functions: $L_{\textnormal{pixel}}$ and $L_{\textnormal{latent}}$ for masked patches, and $L_{\textnormal{cls}}$ for the $\texttt{[CLS]}$ token.
  • Figure 2: The t-SNE visualizations of CIFAR100 "Fish" superclass images. Different colors represent different subclasses. They are better viewed by zooming in.