Knowledge Distillation with Adapted Weight
Sirong Wu, Xi Luo, Junjie Liu, Yuhui Deng
TL;DR
The paper addresses the challenge of deploying large models by introducing KD-AIF, a knowledge distillation framework that assigns adaptive weights to training samples using influence functions to improve generalization under distribution shift and noisy labels. By modeling the data-point influence on validation performance and transforming it into instance-level weights through a sigmoid-based mechanism, KD-AIF enables simultaneous updates of both the teacher and the student and extends to semi-supervised learning. The approach achieves superior performance across image classification and NLP benchmarks, demonstrates robustness to label noise, and provides a pathway toward more transparent and robust distillation under the SAFE principles. While the method incurs Hessian-based computational overhead, the authors outline practical strategies and future directions for scalable approximations and applications to large-scale models.
Abstract
Although large models have shown a strong capacity to solve large-scale problems in many areas including natural language and computer vision, their voluminous parameters are hard to deploy in a real-time system due to computational and energy constraints. Addressing this, knowledge distillation through Teacher-Student architecture offers a sustainable pathway to compress the knowledge of large models into more manageable sizes without significantly compromising performance. To enhance the robustness and interpretability of this framework, it is critical to understand how individual training data impact model performance, which is an area that remains underexplored. We propose the \textbf{Knowledge Distillation with Adaptive Influence Weight (KD-AIF)} framework which leverages influence functions from robust statistics to assign weights to training data, grounded in the four key SAFE principles: Sustainability, Accuracy, Fairness, and Explainability. This novel approach not only optimizes distillation but also increases transparency by revealing the significance of different data. The exploration of various update mechanisms within the KD-AIF framework further elucidates its potential to significantly improve learning efficiency and generalization in student models, marking a step toward more explainable and deployable Large Models. KD-AIF is effective in knowledge distillation while also showing exceptional performance in semi-supervised learning with outperforms existing baselines and methods in multiple benchmarks (CIFAR-100, CIFAR-10-4k, SVHN-1k, and GLUE).
