AMD: Automatic Multi-step Distillation of Large-scale Vision Models
Cheng Han, Qifan Wang, Sohail A. Dianat, Majid Rabbani, Raghuveer M. Rao, Yi Fang, Qiang Guan, Lifu Huang, Dongfang Liu
TL;DR
This paper tackles the challenge of compressing large transformer-based vision models for resource-constrained deployment by proposing Automatic Multi-step Distillation (AMD). AMD automatically discovers an optimal teacher-assistant (TA) through a three-stage cascade: Structural Pruning to generate TA candidates, Joint Optimization to evaluate them with shared parameters, and Optimal Selection using the Negative Performance-Scale Derivative (NPSD) to pick the best TA. The method jointly distills from the full teacher to the selected TA and then to the student, using a composite loss that includes cross-entropy, a distillation logit term, and a feature-mimicking term. Across CIFAR-10, CIFAR-100, and ImageNet, AMD consistently surpasses single-step and multi-step baselines for ViT and Swin architectures, while also delivering significantly faster training, validating its effectiveness for large-scale vision model compression and deployment on resource-limited devices.
Abstract
Transformer-based architectures have become the de-facto standard models for diverse vision tasks owing to their superior performance. As the size of the models continues to scale up, model distillation becomes extremely important in various real applications, particularly on devices limited by computational resources. However, prevailing knowledge distillation methods exhibit diminished efficacy when confronted with a large capacity gap between the teacher and the student, e.g, 10x compression rate. In this paper, we present a novel approach named Automatic Multi-step Distillation (AMD) for large-scale vision model compression. In particular, our distillation process unfolds across multiple steps. Initially, the teacher undergoes distillation to form an intermediate teacher-assistant model, which is subsequently distilled further to the student. An efficient and effective optimization framework is introduced to automatically identify the optimal teacher-assistant that leads to the maximal student performance. We conduct extensive experiments on multiple image classification datasets, including CIFAR-10, CIFAR-100, and ImageNet. The findings consistently reveal that our approach outperforms several established baselines, paving a path for future knowledge distillation methods on large-scale vision models.
