Table of Contents
Fetching ...

Revealing the Dark Secrets of Masked Image Modeling

Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, Yue Cao

TL;DR

This paper probes masked image modeling (MIM) as a general pre-training paradigm by contrasting it with supervised pre-training through attention-map visualizations and CKA similarity analyses, plus large-scale fine-tuning across semantic, geometric, and combined tasks. It uncovers that MIM injects a locality inductive bias across layers while preserving diverse attention heads, leading to strong performance on pose estimation, depth estimation, and tracking, and competitive results on semantic tasks. The findings explain why MIM benefits Vision Transformers with large receptive fields and suggest MIM as a robust, general-purpose pre-training approach. The work highlights MIM's potential to complement or surpass supervised pre-training in diverse downstream settings and motivates further exploration of MIM across architectures.

Abstract

Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction.

Revealing the Dark Secrets of Masked Image Modeling

TL;DR

This paper probes masked image modeling (MIM) as a general pre-training paradigm by contrasting it with supervised pre-training through attention-map visualizations and CKA similarity analyses, plus large-scale fine-tuning across semantic, geometric, and combined tasks. It uncovers that MIM injects a locality inductive bias across layers while preserving diverse attention heads, leading to strong performance on pose estimation, depth estimation, and tracking, and competitive results on semantic tasks. The findings explain why MIM benefits Vision Transformers with large receptive fields and suggest MIM as a robust, general-purpose pre-training approach. The work highlights MIM's potential to complement or surpass supervised pre-training in diverse downstream settings and motivates further exploration of MIM across architectures.

Abstract

Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction.
Paper Structure (31 sections, 1 equation, 14 figures, 7 tables)

This paper contains 31 sections, 1 equation, 14 figures, 7 tables.

Figures (14)

  • Figure 1: The averaged attention distance in different attention heads (dots) w.r.t the layer number on supervised model (a), contrastive learning model (b), and SimMIM model (c) with ViT-B as the backbone architecture.
  • Figure 2: (a) The error rate of fine-tuning on ImageNet-1K (blue circle $\circ$) and averaged attention distance (red diamond $\diamond$) w.r.t AvgDist (averaged distance of masked pixels to the nearest visible pixels) with Swin-B as the backbone. Points ($\diamond$ or $\circ$) denote the SimMIM models with different masking ratios and masked patch sizes. (b) The averaged attention distance in different attention heads (dots) w.r.t the layer number on supervised model (b1) and SimMIM model (b2) with Swin-B as the backbone.
  • Figure 3: The entropy of each head's attention distribution w.r.t the layer number on (a) supervised model, (b) contrastive learning model, and (c) SimMIM model with ViT-B as the backbone.
  • Figure 4: The KL divergence between attention distributions of different heads (small dots) and the averaged KL divergence (large dots) in each layer w.r.t the layer number on (a) supervised model, (b) contrastive learning model, and (c) SimMIM model with ViT-B as the backbone architecture.
  • Figure 5: The performance of the COCO $val2017$ pose estimation (left) and NYUv2 depth estimation (right) when we drop several last layers of the SwinV2-B backbone. When the model becomes smaller, the performance of the supervised pre-trained model increases on the pose estimation and keeps the same on the depth estimation. The last layers in the supervised pre-trained model lose diversity across different attention heads and are harmful to the downstream tasks.
  • ...and 9 more figures