Knowledge Distillation with Adapted Weight

Sirong Wu; Xi Luo; Junjie Liu; Yuhui Deng

Knowledge Distillation with Adapted Weight

Sirong Wu, Xi Luo, Junjie Liu, Yuhui Deng

TL;DR

The paper addresses the challenge of deploying large models by introducing KD-AIF, a knowledge distillation framework that assigns adaptive weights to training samples using influence functions to improve generalization under distribution shift and noisy labels. By modeling the data-point influence on validation performance and transforming it into instance-level weights through a sigmoid-based mechanism, KD-AIF enables simultaneous updates of both the teacher and the student and extends to semi-supervised learning. The approach achieves superior performance across image classification and NLP benchmarks, demonstrates robustness to label noise, and provides a pathway toward more transparent and robust distillation under the SAFE principles. While the method incurs Hessian-based computational overhead, the authors outline practical strategies and future directions for scalable approximations and applications to large-scale models.

Abstract

Although large models have shown a strong capacity to solve large-scale problems in many areas including natural language and computer vision, their voluminous parameters are hard to deploy in a real-time system due to computational and energy constraints. Addressing this, knowledge distillation through Teacher-Student architecture offers a sustainable pathway to compress the knowledge of large models into more manageable sizes without significantly compromising performance. To enhance the robustness and interpretability of this framework, it is critical to understand how individual training data impact model performance, which is an area that remains underexplored. We propose the \textbf{Knowledge Distillation with Adaptive Influence Weight (KD-AIF)} framework which leverages influence functions from robust statistics to assign weights to training data, grounded in the four key SAFE principles: Sustainability, Accuracy, Fairness, and Explainability. This novel approach not only optimizes distillation but also increases transparency by revealing the significance of different data. The exploration of various update mechanisms within the KD-AIF framework further elucidates its potential to significantly improve learning efficiency and generalization in student models, marking a step toward more explainable and deployable Large Models. KD-AIF is effective in knowledge distillation while also showing exceptional performance in semi-supervised learning with outperforms existing baselines and methods in multiple benchmarks (CIFAR-100, CIFAR-10-4k, SVHN-1k, and GLUE).

Knowledge Distillation with Adapted Weight

TL;DR

Abstract

Paper Structure (24 sections, 2 theorems, 33 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 2 theorems, 33 equations, 5 figures, 6 tables, 1 algorithm.

Introduction
Background and Problem Statement
Assumptions and Base Models
Assumption
Vanilla distillation
Online distillation
Problem Statement
Methods
Influence Function
Influence Weight
Analysis of weight function
KD-AIF
KD-AIF in Semi-Supervised Learning
Experiments
Datasets and Hyperparameter settings
...and 9 more sections

Key Result

Theorem 1

The performance of the perturbed model $\hat{\theta}_{S,\epsilon}$ on the validation distribution $Q^\prime$ will be better than that of the original model $\hat{\theta}_{S}$, if the following two conditions are satisfied: 1. $\epsilon_i = 0$ when $\phi_i(\hat{\theta}) = 0$; 2. The perturbation $\ep

Figures (5)

Figure 1: Example training data in MINIST, CIFAR-10, CIFAR-100, ImageNet dataset. Left: A helpful training data with the correct label; Mid: A harmful training data with the correct label; Right: A harmful training data with the wrong label. The presence of harmful training data will make the model misclassify the test data.
Figure 2: Comparison of vanilla distillation, online distillation, and our proposed framework KD-AIF. The dotted red lines show the feedback from student model. Note that vanilla distillation employs a two-stage training pipeline by first fine-tuning the teacher model on the target task. Both online distillation and KD-AIF adopt a one-stage joint training strategy, allowing simultaneous training of the teacher and student models. Unlike online distillation's equal weighting, KD-AIF introduces an adaptive feedback mechanism based on influence scores, which evaluates the importance of training data and dynamically adjusts the training process to improve generalization and robustness.
Figure 3: The workflow of KD-AIF framework in Semi-Supervised Learning. The teacher model is initially trained on labelled data (Step 1) and then used to generate pseudo-labels for unlabelled data (Step 2). The labelled and pseudo-labelled data are then combined to train
Figure 4: Different update mechanisms. $T$ and $S$ denote the teacher model and the student model, respectively. $w$ denotes the influence weight calculated on the student model. The subscript number represents the distillation iteration. In Mechanism 1, the influence weight is used solely to update the student model, while in Mechanism 2, it is used exclusively to update the teacher model. Mechanism 3 proposes using the influence weight for updating both the teacher and student models. Building on Mechanism 3, Mechanism 4 also takes into account the influence weight calculated by the previous student model. Each mechanism will repeat the distillation process $k$ times.
Figure 5: Distribution of influence weights on 10% noisy training data in the CIFAR-100 and SST-2 datasets. If the influence weight is less than 1, the corresponding data is considered harmful. Our proposed method KD-AIF demonstrates the ability to accurately identify a majority of noisy data across different classification tasks.

Theorems & Definitions (6)

Theorem 1
Definition 1
Definition 2
Theorem 2
proof
proof

Knowledge Distillation with Adapted Weight

TL;DR

Abstract

Knowledge Distillation with Adapted Weight

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (6)