Table of Contents
Fetching ...

Toward Student-Oriented Teacher Network Training For Knowledge Distillation

Chengyu Dong, Liyuan Liu, Jingbo Shang

TL;DR

The paper tackles the misalignment between teacher optimization and student performance in knowledge distillation by proposing SoTeacher, a teacher-training framework that uses ERM augmented with Lipschitz and consistency regularization to better approximate the true label distribution $p^*(x)$. The authors establish theoretical feasibility for learning $p^*(x)$ under a mixed-feature data model with a Lipschitz, transformation-robust feature extractor, and they implement SoTeacher to enforce the necessary regularities. Empirical results across CIFAR-100, Tiny-ImageNet, and ImageNet demonstrate consistent improvements in student accuracy across various KD algorithms and architectures, even when teacher accuracy drops, indicating improved knowledge transfer. The approach is practical, adds minimal overhead via regularization terms and temporal ensembling, and offers a principled direction to enhance distillation effectiveness with broader implications for transfer learning and ensemble methods.

Abstract

How to conduct teacher training for knowledge distillation is still an open problem. It has been widely observed that a best-performing teacher does not necessarily yield the best-performing student, suggesting a fundamental discrepancy between the current teacher training practice and the ideal teacher training strategy. To fill this gap, we explore the feasibility of training a teacher that is oriented toward student performance with empirical risk minimization (ERM). Our analyses are inspired by the recent findings that the effectiveness of knowledge distillation hinges on the teacher's capability to approximate the true label distribution of training inputs. We theoretically establish that the ERM minimizer can approximate the true label distribution of training data as long as the feature extractor of the learner network is Lipschitz continuous and is robust to feature transformations. In light of our theory, we propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM. Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.

Toward Student-Oriented Teacher Network Training For Knowledge Distillation

TL;DR

The paper tackles the misalignment between teacher optimization and student performance in knowledge distillation by proposing SoTeacher, a teacher-training framework that uses ERM augmented with Lipschitz and consistency regularization to better approximate the true label distribution . The authors establish theoretical feasibility for learning under a mixed-feature data model with a Lipschitz, transformation-robust feature extractor, and they implement SoTeacher to enforce the necessary regularities. Empirical results across CIFAR-100, Tiny-ImageNet, and ImageNet demonstrate consistent improvements in student accuracy across various KD algorithms and architectures, even when teacher accuracy drops, indicating improved knowledge transfer. The approach is practical, adds minimal overhead via regularization terms and temporal ensembling, and offers a principled direction to enhance distillation effectiveness with broader implications for transfer learning and ensemble methods.

Abstract

How to conduct teacher training for knowledge distillation is still an open problem. It has been widely observed that a best-performing teacher does not necessarily yield the best-performing student, suggesting a fundamental discrepancy between the current teacher training practice and the ideal teacher training strategy. To fill this gap, we explore the feasibility of training a teacher that is oriented toward student performance with empirical risk minimization (ERM). Our analyses are inspired by the recent findings that the effectiveness of knowledge distillation hinges on the teacher's capability to approximate the true label distribution of training inputs. We theoretically establish that the ERM minimizer can approximate the true label distribution of training data as long as the feature extractor of the learner network is Lipschitz continuous and is robust to feature transformations. In light of our theory, we propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM. Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.
Paper Structure (26 sections, 7 theorems, 53 equations, 2 figures, 7 tables)

This paper contains 26 sections, 7 theorems, 53 equations, 2 figures, 7 tables.

Key Result

Lemma 3.3

Let $\bar{y}_z \coloneqq \frac{1}{N} \sum_{\{i| z\in\mathcal{Z}^{(i)}\}} 1(y^{(i)})$, where $\mathcal{Z}^{(i)}$ denote the set of feature names in the $i$-th input, and thus $\{i| z\in\mathcal{Z}^{(i)} \}$ denotes the set of inputs that contain feature $z$. Let $f^*$ be a minimizer of the empirical

Figures (2)

  • Figure 1: We train teacher models on CIFAR-100, saving a checkpoint every 10 epochs, and then use this checkpoint to train a student model through knowledge distillation. With our method, the teacher is trained with a focus on improving student performance, leading to better student performance even if the teacher's own performance is not as high.
  • Figure 2: Effect of varying the hyperparameters in our teacher training method, including the weight for Lipschitz regularization $\lambda_{\text{LR}}$, the weight for consistency regularization $\lambda_{\text{CR}}$, and its scheduler. The settings of our method used to report the results (e.g. Table \ref{['table:experiment-tiny-imagenet']}) are denoted as "$\blacktriangle$". The standard teacher training practice is denoted as "$\blacksquare$" for comparison.

Theorems & Definitions (18)

  • Definition 3.1: Mixed-feature distribution
  • Definition 3.2: Invariant feature extractor
  • Lemma 3.3: Convergence of the probabilistic predictions of features
  • Lemma 3.4: Convergence of the sample mean of labels
  • Lemma 3.5: Approximation of the true label distribution of each feature
  • Theorem 3.6: Approximation error under a hypothetical case
  • Definition 3.7: Lipschitz-continuous feature extractor
  • Definition 3.8: Transformation-robust feature extractor
  • Theorem 3.9: Approximation error under a realistic case
  • Definition A.1: Modified Softmax
  • ...and 8 more