Toward Student-Oriented Teacher Network Training For Knowledge Distillation

Chengyu Dong; Liyuan Liu; Jingbo Shang

Toward Student-Oriented Teacher Network Training For Knowledge Distillation

Chengyu Dong, Liyuan Liu, Jingbo Shang

TL;DR

The paper tackles the misalignment between teacher optimization and student performance in knowledge distillation by proposing SoTeacher, a teacher-training framework that uses ERM augmented with Lipschitz and consistency regularization to better approximate the true label distribution $p^*(x)$. The authors establish theoretical feasibility for learning $p^*(x)$ under a mixed-feature data model with a Lipschitz, transformation-robust feature extractor, and they implement SoTeacher to enforce the necessary regularities. Empirical results across CIFAR-100, Tiny-ImageNet, and ImageNet demonstrate consistent improvements in student accuracy across various KD algorithms and architectures, even when teacher accuracy drops, indicating improved knowledge transfer. The approach is practical, adds minimal overhead via regularization terms and temporal ensembling, and offers a principled direction to enhance distillation effectiveness with broader implications for transfer learning and ensemble methods.

Abstract

How to conduct teacher training for knowledge distillation is still an open problem. It has been widely observed that a best-performing teacher does not necessarily yield the best-performing student, suggesting a fundamental discrepancy between the current teacher training practice and the ideal teacher training strategy. To fill this gap, we explore the feasibility of training a teacher that is oriented toward student performance with empirical risk minimization (ERM). Our analyses are inspired by the recent findings that the effectiveness of knowledge distillation hinges on the teacher's capability to approximate the true label distribution of training inputs. We theoretically establish that the ERM minimizer can approximate the true label distribution of training data as long as the feature extractor of the learner network is Lipschitz continuous and is robust to feature transformations. In light of our theory, we propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM. Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.

Toward Student-Oriented Teacher Network Training For Knowledge Distillation

TL;DR

. The authors establish theoretical feasibility for learning

under a mixed-feature data model with a Lipschitz, transformation-robust feature extractor, and they implement SoTeacher to enforce the necessary regularities. Empirical results across CIFAR-100, Tiny-ImageNet, and ImageNet demonstrate consistent improvements in student accuracy across various KD algorithms and architectures, even when teacher accuracy drops, indicating improved knowledge transfer. The approach is practical, adds minimal overhead via regularization terms and temporal ensembling, and offers a principled direction to enhance distillation effectiveness with broader implications for transfer learning and ensemble methods.

Abstract

Paper Structure (26 sections, 7 theorems, 53 equations, 2 figures, 7 tables)

This paper contains 26 sections, 7 theorems, 53 equations, 2 figures, 7 tables.

Introduction
Preliminaries
Theoretical Feasibility to Learn True Label Distribution of Training Data
Notations and Problem Setup
A Hypothetical Case: Invariant Feature Extractor
Realistic Case
SoTeacher
Experiments
Experiment setup
Results
Related Work
Conclusion and Future Work
Proof
Modified Softmax
Lemma \ref{['lemma:prediction-feature']}
...and 11 more sections

Key Result

Lemma 3.3

Let $\bar{y}_z \coloneqq \frac{1}{N} \sum_{\{i| z\in\mathcal{Z}^{(i)}\}} 1(y^{(i)})$, where $\mathcal{Z}^{(i)}$ denote the set of feature names in the $i$-th input, and thus $\{i| z\in\mathcal{Z}^{(i)} \}$ denotes the set of inputs that contain feature $z$. Let $f^*$ be a minimizer of the empirical

Figures (2)

Figure 1: We train teacher models on CIFAR-100, saving a checkpoint every 10 epochs, and then use this checkpoint to train a student model through knowledge distillation. With our method, the teacher is trained with a focus on improving student performance, leading to better student performance even if the teacher's own performance is not as high.
Figure 2: Effect of varying the hyperparameters in our teacher training method, including the weight for Lipschitz regularization $\lambda_{\text{LR}}$, the weight for consistency regularization $\lambda_{\text{CR}}$, and its scheduler. The settings of our method used to report the results (e.g. Table \ref{['table:experiment-tiny-imagenet']}) are denoted as "$\blacktriangle$". The standard teacher training practice is denoted as "$\blacksquare$" for comparison.

Theorems & Definitions (18)

Definition 3.1: Mixed-feature distribution
Definition 3.2: Invariant feature extractor
Lemma 3.3: Convergence of the probabilistic predictions of features
Lemma 3.4: Convergence of the sample mean of labels
Lemma 3.5: Approximation of the true label distribution of each feature
Theorem 3.6: Approximation error under a hypothetical case
Definition 3.7: Lipschitz-continuous feature extractor
Definition 3.8: Transformation-robust feature extractor
Theorem 3.9: Approximation error under a realistic case
Definition A.1: Modified Softmax
...and 8 more

Toward Student-Oriented Teacher Network Training For Knowledge Distillation

TL;DR

Abstract

Toward Student-Oriented Teacher Network Training For Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (18)