Table of Contents
Fetching ...

Distilling Balanced Knowledge from a Biased Teacher

Seonghak Kim

TL;DR

LTKD introduces a rebalanced cross-group loss that calibrates the teacher's group-level predictions and a reweighted within-group loss that ensures equal contribution from all groups, thereby showing its ability to distill balanced knowledge from a biased teacher for real-world applications.

Abstract

Conventional knowledge distillation, designed for model compression, fails on long-tailed distributions because the teacher model tends to be biased toward head classes and provides limited supervision for tail classes. We propose Long-Tailed Knowledge Distillation (LTKD), a novel framework that reformulates the conventional objective into two components: a cross-group loss, capturing mismatches in prediction distributions across class groups (head, medium, and tail), and a within-group loss, capturing discrepancies within each group's distribution. This decomposition reveals the specific sources of the teacher's bias. To mitigate the inherited bias, LTKD introduces (1) a rebalanced cross-group loss that calibrates the teacher's group-level predictions and (2) a reweighted within-group loss that ensures equal contribution from all groups. Extensive experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT demonstrate that LTKD significantly outperforms existing methods in both overall and tail-class accuracy, thereby showing its ability to distill balanced knowledge from a biased teacher for real-world applications.

Distilling Balanced Knowledge from a Biased Teacher

TL;DR

LTKD introduces a rebalanced cross-group loss that calibrates the teacher's group-level predictions and a reweighted within-group loss that ensures equal contribution from all groups, thereby showing its ability to distill balanced knowledge from a biased teacher for real-world applications.

Abstract

Conventional knowledge distillation, designed for model compression, fails on long-tailed distributions because the teacher model tends to be biased toward head classes and provides limited supervision for tail classes. We propose Long-Tailed Knowledge Distillation (LTKD), a novel framework that reformulates the conventional objective into two components: a cross-group loss, capturing mismatches in prediction distributions across class groups (head, medium, and tail), and a within-group loss, capturing discrepancies within each group's distribution. This decomposition reveals the specific sources of the teacher's bias. To mitigate the inherited bias, LTKD introduces (1) a rebalanced cross-group loss that calibrates the teacher's group-level predictions and (2) a reweighted within-group loss that ensures equal contribution from all groups. Extensive experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT demonstrate that LTKD significantly outperforms existing methods in both overall and tail-class accuracy, thereby showing its ability to distill balanced knowledge from a biased teacher for real-world applications.

Paper Structure

This paper contains 22 sections, 15 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of standard KD on long-tailed distributions. The training data is highly imbalanced (pie chart and bar graph), with most samples belonging to head classes (orange) and few to tail classes (blue). This creates a biased teacher whose predictions ($\mathbf{p}^T$), visually represented by varying bar heights, are skewed toward head classes. Standard KD forces the student to mimic this bias ($\mathbf{p}^S$), resulting in poor generalization on tail classes.
  • Figure 2: Overview of the proposed Long-Tailed Knowledge Distillation (LTKD). Our method first decomposes the standard KL-based KD loss into a cross-group component (capturing mismatches in aggregated group-level predictions) and a within-group component (capturing internal discrepancies). To correct the teacher's class bias, LTKD then applies a rebalanced cross-group loss and reweighted within-group loss, ensuring a balanced knowledge transfer to the student.
  • Figure 3: Group-wise prediction trends of the teacher model on the CIFAR-100 dataset across training batches. On the balanced version (solid lines), the teacher produces nearly uniform group-wise outputs. In contrast, on the long-tailed version (dotted lines), the teacher assigns higher probabilities to head classes and lower probabilities to tail classes.
  • Figure 4: Distillation loss curves for KD, DKD, and LTKD for Head (left), Medium (middle), and Tail (right) class groups. LTKD achieves a consistently lower loss across all three groups compared to the baselines, overcoming the suboptimal convergence caused by teacher bias.
  • Figure 5: Hyperparameter sensitivity of LTKD on CIFAR-100-LT ($\gamma=20$), showing overall accuracy (Left) and tail-class accuracy (Right). The blue line shows the effect of varying the cross-group weight $\alpha$ (while $\beta=1$). The red line shows the effect of varying the within-group weight $\beta$ (while $\alpha=6$).