Table of Contents
Fetching ...

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, Limin Wang

TL;DR

This work tackles the computational burden of pre-training large masked autoencoders by introducing Asymmetric Masked Distillation (AMD) to train compact vision transformers. AMD uses a teacher with lower masking ratio to access richer context while the student remains highly masked, and it enforces serial multi-layer feature alignment between teacher and student to regularize learning. The approach yields strong performance on both video and image MAE tasks, achieving 84.6% top-1 on IN1K with ImageMAE and 73.3% on Something-Something V2 with VideoMAE, while also delivering transfer gains on downstream benchmarks. Overall, AMD provides a scalable path to small, robust foundation models with improved efficiency and transferability, with code available to replicate the results.

Abstract

Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model is still equipped with a high masking ratio. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the original masked autoencoding. The code and models are available at https://github.com/MCG-NJU/AMD.

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

TL;DR

This work tackles the computational burden of pre-training large masked autoencoders by introducing Asymmetric Masked Distillation (AMD) to train compact vision transformers. AMD uses a teacher with lower masking ratio to access richer context while the student remains highly masked, and it enforces serial multi-layer feature alignment between teacher and student to regularize learning. The approach yields strong performance on both video and image MAE tasks, achieving 84.6% top-1 on IN1K with ImageMAE and 73.3% on Something-Something V2 with VideoMAE, while also delivering transfer gains on downstream benchmarks. Overall, AMD provides a scalable path to small, robust foundation models with improved efficiency and transferability, with code available to replicate the results.

Abstract

Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model is still equipped with a high masking ratio. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the original masked autoencoding. The code and models are available at https://github.com/MCG-NJU/AMD.
Paper Structure (24 sections, 9 equations, 5 figures, 12 tables)

This paper contains 24 sections, 9 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Comparison of symmetric and asymmetric masking strategy. The asymmetric masking strategy allows the teacher to acquire more contextual information than the students.
  • Figure 2: Pipeline of Asymmetric Masked Distillation (AMD). We present an asymmetric masking strategy to transfer the knowledge of teacher pre-trained models to the student masked pre-training. Our asymmetric masking strategy allows a lower masking ratio for the teacher to enable extracting richer visual information. The richer visual information could be used as guidance information to regularize the student masked pre-training and results in a more powerful pre-trained model, that could benefit a variety of downstream tasks.
  • Figure 3: We apply four feature alignment methods in our work, with the serial alignment being our default setting.
  • Figure 4: The average attention distance in different attention heads at each layer depth. Distances are calculated over 16 frames, and frame spacing is calculated over the maximum distance of each frame. Results are averaged over SSV2 test set.
  • Figure 5: Detailed breakdown of accuracy comparison between AMD and DMAE by categories. We checked the performance gap on SSV2 in terms of categories on the test set.