An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training
Jin Gao, Shubo Lin, Shaoru Wang, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu
TL;DR
The paper investigates whether extremely simple lightweight Vision Transformers can benefit from masked image modeling pre-training and identifies that high-level semantics are poorly learned in upper layers under MIM. Through an observation-analysis-solution flow, the authors develop distillation-based MAE pre-training (including a decoupled variant, D2-MAE) to transfer high-level knowledge from a larger teacher to lightweight students, preserving useful locality biases. The approach achieves strong results on ImageNet with ViT-Tiny (79.4% top-1) and Hiera-Tiny (78.9% top-1), and sets state-of-the-art performance for ADE20K segmentation and LaSOT tracking in the lightweight regime. They further demonstrate that applying distillation to MAE pre-training improves transfer to data-scarce downstream tasks and transfers to hierarchical architectures like Hiera-Tiny, underscoring the approach's generality and practical impact for efficient vision models.
Abstract
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the \textit{extremely simple} lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology. We use an observation-analysis-solution flow for our study. We first systematically observe different behaviors among the evaluated pre-training methods with respect to the downstream fine-tuning data scales. Furthermore, we analyze the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory transfer performance on data-insufficient downstream tasks. This finding is naturally a guide to designing our distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments have demonstrated the effectiveness of our approach. Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4\%$/$78.9\%$ top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K segmentation task ($42.8\%$ mIoU) and LaSOT tracking task ($66.1\%$ AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.
