Table of Contents
Fetching ...

Symmetric masking strategy enhances the performance of Masked Image Modeling

Khanh-Binh Nguyen, Chae Jung Park

TL;DR

Masked Image Modeling (MIM) methods typically rely on random masking and extensive masking-ratio searches, which are computationally expensive and dataset-sensitive. This paper introduces SymMIM, a symmetric masking strategy that applies checkerboard-like masking along both axes to align masked patches with semantically similar visible regions, reducing hyperparameter tuning. The training objective combines a patch-level reconstruction loss $L_{mim}$ with a momentum-contrastive loss $L_{con}$ under an asymmetric MoCo-v3–style setup with an EMA encoder, including $ heta_k \leftarrow m \theta_k + (1 - m) \theta_q$ and a temperature parameter $\tau$ in the InfoNCE loss. Empirically, SymMIM achieves state-of-the-art results on ImageNet-1K (e.g., ViT-Large reaches $85.9\%$ top-1) and shows strong gains across downstream tasks such as object detection, instance segmentation, and semantic segmentation, while reducing the need for masking-ratio probing and complex tokenizers.

Abstract

Masked Image Modeling (MIM) is a technique in self-supervised learning that focuses on acquiring detailed visual representations from unlabeled images by estimating the missing pixels in randomly masked sections. It has proven to be a powerful tool for the preliminary training of Vision Transformers (ViTs), yielding impressive results across various tasks. Nevertheless, most MIM methods heavily depend on the random masking strategy to formulate the pretext task. This strategy necessitates numerous trials to ascertain the optimal dropping ratio, which can be resource-intensive, requiring the model to be pre-trained for anywhere between 800 to 1600 epochs. Furthermore, this approach may not be suitable for all datasets. In this work, we propose a new masking strategy that effectively helps the model capture global and local features. Based on this masking strategy, SymMIM, our proposed training pipeline for MIM is introduced. SymMIM achieves a new SOTA accuracy of 85.9\% on ImageNet using ViT-Large and surpasses previous SOTA across downstream tasks such as image classification, semantic segmentation, object detection, instance segmentation tasks, and so on.

Symmetric masking strategy enhances the performance of Masked Image Modeling

TL;DR

Masked Image Modeling (MIM) methods typically rely on random masking and extensive masking-ratio searches, which are computationally expensive and dataset-sensitive. This paper introduces SymMIM, a symmetric masking strategy that applies checkerboard-like masking along both axes to align masked patches with semantically similar visible regions, reducing hyperparameter tuning. The training objective combines a patch-level reconstruction loss with a momentum-contrastive loss under an asymmetric MoCo-v3–style setup with an EMA encoder, including and a temperature parameter in the InfoNCE loss. Empirically, SymMIM achieves state-of-the-art results on ImageNet-1K (e.g., ViT-Large reaches top-1) and shows strong gains across downstream tasks such as object detection, instance segmentation, and semantic segmentation, while reducing the need for masking-ratio probing and complex tokenizers.

Abstract

Masked Image Modeling (MIM) is a technique in self-supervised learning that focuses on acquiring detailed visual representations from unlabeled images by estimating the missing pixels in randomly masked sections. It has proven to be a powerful tool for the preliminary training of Vision Transformers (ViTs), yielding impressive results across various tasks. Nevertheless, most MIM methods heavily depend on the random masking strategy to formulate the pretext task. This strategy necessitates numerous trials to ascertain the optimal dropping ratio, which can be resource-intensive, requiring the model to be pre-trained for anywhere between 800 to 1600 epochs. Furthermore, this approach may not be suitable for all datasets. In this work, we propose a new masking strategy that effectively helps the model capture global and local features. Based on this masking strategy, SymMIM, our proposed training pipeline for MIM is introduced. SymMIM achieves a new SOTA accuracy of 85.9\% on ImageNet using ViT-Large and surpasses previous SOTA across downstream tasks such as image classification, semantic segmentation, object detection, instance segmentation tasks, and so on.
Paper Structure (17 sections, 5 equations, 3 figures, 6 tables)

This paper contains 17 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The pipeline of SymMIM. SymMIM performs the symmetric masked patch prediction and the EMA update policy from MoCo-v3 chenempirical to effectively guides the representation learning of the trained model and enhances the network’s capability to capture more fine-grained visual context. Following MoCo-v3 design, there is a projection head and prediction head attached to the online encoder while for the EMA encoder, there is only an EMA projection head.
  • Figure 2: Masking ratio comparison. Traditional MIM such as SimMIM xie2022simmim and MAE he2022masked need to performs the masking ratio probing to find the optimal ratio, this results in a very expensive procedure.
  • Figure 3: Recovered images using four different mask types: random masking, very small size symmetric masking, small size symmetric masking, and large size symmetric masking.