An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

Jin Gao; Shubo Lin; Shaoru Wang; Yutong Kou; Zeming Li; Liang Li; Congxuan Zhang; Xiaoqin Zhang; Yizheng Wang; Weiming Hu

An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

Jin Gao, Shubo Lin, Shaoru Wang, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu

TL;DR

The paper investigates whether extremely simple lightweight Vision Transformers can benefit from masked image modeling pre-training and identifies that high-level semantics are poorly learned in upper layers under MIM. Through an observation-analysis-solution flow, the authors develop distillation-based MAE pre-training (including a decoupled variant, D2-MAE) to transfer high-level knowledge from a larger teacher to lightweight students, preserving useful locality biases. The approach achieves strong results on ImageNet with ViT-Tiny (79.4% top-1) and Hiera-Tiny (78.9% top-1), and sets state-of-the-art performance for ADE20K segmentation and LaSOT tracking in the lightweight regime. They further demonstrate that applying distillation to MAE pre-training improves transfer to data-scarce downstream tasks and transfers to hierarchical architectures like Hiera-Tiny, underscoring the approach's generality and practical impact for efficient vision models.

Abstract

Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the \textit{extremely simple} lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology. We use an observation-analysis-solution flow for our study. We first systematically observe different behaviors among the evaluated pre-training methods with respect to the downstream fine-tuning data scales. Furthermore, we analyze the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory transfer performance on data-insufficient downstream tasks. This finding is naturally a guide to designing our distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments have demonstrated the effectiveness of our approach. Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4\%$/$78.9\%$ top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K segmentation task ($42.8\%$ mIoU) and LaSOT tracking task ($66.1\%$ AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.

An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

TL;DR

Abstract

) can achieve

top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K segmentation task (

mIoU) and LaSOT tracking task (

AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.

Paper Structure (19 sections, 5 equations, 10 figures, 7 tables)

This paper contains 19 sections, 5 equations, 10 figures, 7 tables.

Introduction
Related Work
Self-Supervised Learning
Vision Transformers
Knowledge Distillation
Observation
Preliminaries and Experimental Study Design
Fine-Tuning Evaluation on ImageNet Classification
Transfer Learning Evaluation on Downstream Tasks
Analysis
Linear Probing Evaluation
Layer-Wise Representation Analysis
Attention Map Analysis
Solution
Improve MAE Pre-Training Based on Distillation
...and 4 more sections

Figures (10)

Figure 1: Our SSL pre-training with distillation on pure lightweight ViT-Tiny (5.7M)/Hiera-Tiny (6.5M) can achieve $79.4\%$/$78.9\%$ top-1 accuracy on ImageNet-1K validation set, which bridges the performance gap between extremely simple ViT architectures and delicately designed ones in the lightweight regime. The latency is measured on Orin with batch size 1. The transfer evaluation on other downstream tasks is also impressive.
Figure 2: Transformer block, where the feature map after the first LN of block $k$ can be used as normalized output representation for block $k-1$.
Figure 3: Layer representation similarity as heatmaps for the investigated ViT-Tiny models from different pre-training methods, with x and y axes indexing the layers (the 0 index indicates the patch embedding layer), and higher values indicate higher similarity. The fully-supervised baseline based on our recipe (see \ref{['tab:imagenetcompare']}) on IN1K is used as the reference.
Figure 4: Lower layers of the pre-trained ViT-Tiny models contribute to most gains on the data-sufficient IN1K. The contributions from higher layers of the pre-trained models increase as the downstream dataset scale shrinks, which indicates that higher layers matter in data-insufficient downstream tasks.
Figure 5: Attention distance and entropy analyses. We visualize the layer-by-layer distributions of the average attention distance and entropy across all the different attention heads using the box-whisker plots. We specifically plot the MAE pre-trained and MoCo-v3 pre-trained ViT-Tiny models along with the randomly initialized one in the same figures (see left) for a more intuitive and compact comparison.
...and 5 more figures

An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

TL;DR

Abstract

An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

Authors

TL;DR

Abstract

Table of Contents

Figures (10)