Table of Contents
Fetching ...

Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation

Chengming Hu, Haolun Wu, Xuan Li, Chen Ma, Xi Chen, Jun Yan, Boyu Wang, Xue Liu

TL;DR

This study introduces a novel adaptive method for learning a sample-wise knowledge fusion ratio, exploiting both the correctness of teacher and student, as well as how well the student mimics the teacher on each sample.

Abstract

Knowledge distillation aims to train a compact student network using soft supervision from a larger teacher network and hard supervision from ground truths. However, determining an optimal knowledge fusion ratio that balances these supervisory signals remains challenging. Prior methods generally resort to a constant or heuristic-based fusion ratio, which often falls short of a proper balance. In this study, we introduce a novel adaptive method for learning a sample-wise knowledge fusion ratio, exploiting both the correctness of teacher and student, as well as how well the student mimics the teacher on each sample. Our method naturally leads to the intra-sample trilateral geometric relations among the student prediction ($S$), teacher prediction ($T$), and ground truth ($G$). To counterbalance the impact of outliers, we further extend to the inter-sample relations, incorporating the teacher's global average prediction $\bar{T}$ for samples within the same class. A simple neural network then learns the implicit mapping from the intra- and inter-sample relations to an adaptive, sample-wise knowledge fusion ratio in a bilevel-optimization manner. Our approach provides a simple, practical, and adaptable solution for knowledge distillation that can be employed across various architectures and model sizes. Extensive experiments demonstrate consistent improvements over other loss re-weighting methods on image classification, attack detection, and click-through rate prediction.

Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation

TL;DR

This study introduces a novel adaptive method for learning a sample-wise knowledge fusion ratio, exploiting both the correctness of teacher and student, as well as how well the student mimics the teacher on each sample.

Abstract

Knowledge distillation aims to train a compact student network using soft supervision from a larger teacher network and hard supervision from ground truths. However, determining an optimal knowledge fusion ratio that balances these supervisory signals remains challenging. Prior methods generally resort to a constant or heuristic-based fusion ratio, which often falls short of a proper balance. In this study, we introduce a novel adaptive method for learning a sample-wise knowledge fusion ratio, exploiting both the correctness of teacher and student, as well as how well the student mimics the teacher on each sample. Our method naturally leads to the intra-sample trilateral geometric relations among the student prediction (), teacher prediction (), and ground truth (). To counterbalance the impact of outliers, we further extend to the inter-sample relations, incorporating the teacher's global average prediction for samples within the same class. A simple neural network then learns the implicit mapping from the intra- and inter-sample relations to an adaptive, sample-wise knowledge fusion ratio in a bilevel-optimization manner. Our approach provides a simple, practical, and adaptable solution for knowledge distillation that can be employed across various architectures and model sizes. Extensive experiments demonstrate consistent improvements over other loss re-weighting methods on image classification, attack detection, and click-through rate prediction.
Paper Structure (46 sections, 7 equations, 5 figures, 15 tables)

This paper contains 46 sections, 7 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Motivation experiment on CIFAR-100 with a ResNet-34 teacher and a ResNet-18 student. The student is trained with varying knowledge fusion ratio values ($\alpha$). Data is first partitioned into $\mathcal{D}$ (where the teacher predicts correctly) and $\mathcal{D'}$ (incorrect predictions), and further categorized into five equalized groups based on the student-teacher prediction discrepancies ($\mathcal{ST}$), respectively. Our claim is that determining $\alpha$ greatly depends on $\mathcal{ST}$ and the correctness of teacher predictions.
  • Figure 2: A comparison between prior works and our proposed TGeo-KD. The first two blocks show the relations captured in different methods for learning the knowledge fusion ratio from a geometric view on a sample. The third block shows crosscheck comparison on different method attributes. The details for representing each point and computing the geometric relation are demonstrated in Sec. \ref{['sec:method']}.
  • Figure 3: Knowledge fusion ratio distributions learned with (dark) and without (light) incorporating $\mathcal{ST}$ during learning $\alpha$. We first partition all samples into two subsets based on the teacher's correctness. In each subset, we sort the samples in descending order based on their $\mathcal{ST}$ values and select the top and bottom 20% as those with large and small discrepancies, respectively.
  • Figure 4: Knowledge fusion ratio distributions on normal samples (light) and outliers (dark).
  • Figure 5: Prediction discrepancies on normal samples and outliers during the training and testing.