SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation
Yike Yuan, Huanzhang Dou, Fengjun Guo, Xi Li
TL;DR
SemanticMIM tackles the complementary strengths of masked image modeling and contrastive learning by introducing a proxy-based two-phase framework that first compresses image information into dedicated [PROXY] tokens and then reconstructs masked regions conditioned on these tokens. By disentangling compression and reconstruction, and by using a proxy to bridge the two phases, the approach achieves both semantic consistency and spatial completeness, leading to more linearly separable features and improved performance on classification and segmentation tasks. The method demonstrates strong gains when applied to existing MIM baselines (e.g., BEiT, MaskFeat) and provides interpretable attention visualizations showing region-level Semantic focus guided by [PROXY] tokens. This work advances general visual representation learning with transformer-based encoders by highlighting the importance of semantic compression and explicit positional priors in self-supervised pre-training, offering practical improvements for dense prediction and downstream transfer.
Abstract
This paper represents a neat yet effective framework, named SemanticMIM, to integrate the advantages of masked image modeling (MIM) and contrastive learning (CL) for general visual representation. We conduct a thorough comparative analysis between CL and MIM, revealing that their complementary advantages fundamentally stem from two distinct phases, i.e., compression and reconstruction. Specifically, SemanticMIM leverages a proxy architecture that customizes interaction between image and mask tokens, bridging these two phases to achieve general visual representation with the property of abundant semantic and positional awareness. Through extensive qualitative and quantitative evaluations, we demonstrate that SemanticMIM effectively amalgamates the benefits of CL and MIM, leading to significant enhancement of performance and feature linear separability. SemanticMIM also offers notable interpretability through attention response visualization. Codes are available at https://github.com/yyk-wew/SemanticMIM.
