Table of Contents
Fetching ...

SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation

Yike Yuan, Huanzhang Dou, Fengjun Guo, Xi Li

TL;DR

SemanticMIM tackles the complementary strengths of masked image modeling and contrastive learning by introducing a proxy-based two-phase framework that first compresses image information into dedicated [PROXY] tokens and then reconstructs masked regions conditioned on these tokens. By disentangling compression and reconstruction, and by using a proxy to bridge the two phases, the approach achieves both semantic consistency and spatial completeness, leading to more linearly separable features and improved performance on classification and segmentation tasks. The method demonstrates strong gains when applied to existing MIM baselines (e.g., BEiT, MaskFeat) and provides interpretable attention visualizations showing region-level Semantic focus guided by [PROXY] tokens. This work advances general visual representation learning with transformer-based encoders by highlighting the importance of semantic compression and explicit positional priors in self-supervised pre-training, offering practical improvements for dense prediction and downstream transfer.

Abstract

This paper represents a neat yet effective framework, named SemanticMIM, to integrate the advantages of masked image modeling (MIM) and contrastive learning (CL) for general visual representation. We conduct a thorough comparative analysis between CL and MIM, revealing that their complementary advantages fundamentally stem from two distinct phases, i.e., compression and reconstruction. Specifically, SemanticMIM leverages a proxy architecture that customizes interaction between image and mask tokens, bridging these two phases to achieve general visual representation with the property of abundant semantic and positional awareness. Through extensive qualitative and quantitative evaluations, we demonstrate that SemanticMIM effectively amalgamates the benefits of CL and MIM, leading to significant enhancement of performance and feature linear separability. SemanticMIM also offers notable interpretability through attention response visualization. Codes are available at https://github.com/yyk-wew/SemanticMIM.

SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation

TL;DR

SemanticMIM tackles the complementary strengths of masked image modeling and contrastive learning by introducing a proxy-based two-phase framework that first compresses image information into dedicated [PROXY] tokens and then reconstructs masked regions conditioned on these tokens. By disentangling compression and reconstruction, and by using a proxy to bridge the two phases, the approach achieves both semantic consistency and spatial completeness, leading to more linearly separable features and improved performance on classification and segmentation tasks. The method demonstrates strong gains when applied to existing MIM baselines (e.g., BEiT, MaskFeat) and provides interpretable attention visualizations showing region-level Semantic focus guided by [PROXY] tokens. This work advances general visual representation learning with transformer-based encoders by highlighting the importance of semantic compression and explicit positional priors in self-supervised pre-training, offering practical improvements for dense prediction and downstream transfer.

Abstract

This paper represents a neat yet effective framework, named SemanticMIM, to integrate the advantages of masked image modeling (MIM) and contrastive learning (CL) for general visual representation. We conduct a thorough comparative analysis between CL and MIM, revealing that their complementary advantages fundamentally stem from two distinct phases, i.e., compression and reconstruction. Specifically, SemanticMIM leverages a proxy architecture that customizes interaction between image and mask tokens, bridging these two phases to achieve general visual representation with the property of abundant semantic and positional awareness. Through extensive qualitative and quantitative evaluations, we demonstrate that SemanticMIM effectively amalgamates the benefits of CL and MIM, leading to significant enhancement of performance and feature linear separability. SemanticMIM also offers notable interpretability through attention response visualization. Codes are available at https://github.com/yyk-wew/SemanticMIM.
Paper Structure (19 sections, 4 equations, 15 figures, 6 tables)

This paper contains 19 sections, 4 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: The comparison of attention response between masked image modeling (e.g., BEiT), contrastive learning (e.g., MoCov3), and the proposed SemanticMIM, which could effectively perceive arbitrary semantics with specific queries. The queries are marked with boxes of distinct colors.
  • Figure 2: A unified view of the masked image modeling (i.e., BEiT) and contrastive learning (i.e., MoCov3) paradigm.
  • Figure 3: Information propagation of the contrastive learning, masked image modeling, and our proposed SemanticMIM.
  • Figure 4: Comparison of the architecture between masked image modeling and our proposed SemanticMIM. MIM only focuses on Reconstruction. SemanticMIM contains two cascaded phases via proxy, i.e.,Compression (left) and Reconstruction (right). Besides, The modules of the Reconstruction (right) phase are discarded after pre-training.
  • Figure 5: Number of [CLS] token. The y-axes are ImageNet-1K validation accuracy (%) under fine-tuning protocol.
  • ...and 10 more figures