Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

Jeonghyun Kim; SooKyung Kim; Richeng Xuan; Hyunsoo Cho

Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

Jeonghyun Kim, SooKyung Kim, Richeng Xuan, Hyunsoo Cho

TL;DR

Calibrated Uncertainty Distillation is proposed, a framework designed to make dark knowledge more faithfully accessible by directly shaping the teacher's predictive distribution before transfer, allowing students to benefit from both confident signals on easy cases and structured uncertainty on hard ones.

Abstract

The core of knowledge distillation lies in transferring the teacher's rich 'dark knowledge'-subtle probabilistic patterns that reveal how classes are related and the distribution of uncertainties. While this idea is well established, teachers trained with conventional cross-entropy often fail to preserve such signals. Their distributions collapse into sharp, overconfident peaks that appear decisive but are in fact brittle, offering little beyond the hard label or subtly hindering representation-level transfer. This overconfidence is especially problematic in high-cardinality tasks, where the nuances among many plausible classes matter most for guiding a compact student. Moreover, such brittle targets reduce robustness under distribution shift, leaving students vulnerable to miscalibration in real-world conditions. To address this limitation, we revisit distillation from a distributional perspective and propose Calibrated Uncertainty Distillation (CUD), a framework designed to make dark knowledge more faithfully accessible. Instead of uncritically adopting the teacher's overconfidence, CUD encourages teachers to reveal uncertainty where it is informative and guides students to learn from targets that are calibrated rather than sharpened certainty. By directly shaping the teacher's predictive distribution before transfer, our approach balances accuracy and calibration, allowing students to benefit from both confident signals on easy cases and structured uncertainty on hard ones. Across diverse benchmarks, CUD yields students that are not only more accurate, but also more calibrated under shift and more reliable on ambiguous, long-tail inputs.

Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

TL;DR

Abstract

Paper Structure (52 sections, 1 theorem, 13 equations, 4 figures, 12 tables, 1 algorithm)

This paper contains 52 sections, 1 theorem, 13 equations, 4 figures, 12 tables, 1 algorithm.

Introduction
Problem Formulation
Preliminaries and Limitations of Conventional KD
Guiding Principles for Calibrated Distillation
C1: Difficulty-aware uncertainty.
C2: Calibrated, selective imitation.
Calibration as a Constraint-based Reformulation
R1: Uncertainty with semantic support (for difficulty-aware uncertainty).
R2: Wrong-mass budget (for selective imitation).
Projection to Calibrated Targets
Calibrated Uncertainty Distillation
Difficult-aware Uncertainty Shaping
W-Clip: Wrong-mass Clipping
Student Distillation with Calibrated Targets
Operational Realization of Constraints
...and 37 more sections

Key Result

Theorem 1

Let $\mathcal{Q}$ be the set of valid probability distributions on the simplex $\Delta^K$. The feasible set $\mathcal{C} = \{ q \in \mathcal{Q} \mid \text{R1}(q) \le \epsilon_1, \text{R2}(q) \le \epsilon_2 \}$ is defined by the intersection of linear half-spaces and the probability simplex, making $

Figures (4)

Figure 1: Illustration of the main framework. Left: Difficulty-aware Uncertainty Shaping (DUS) adjusts weights based on the correctness of the model’s predictions, reducing overconfidence and encouraging a more calibrated probability distribution. Right: Wrong-mass Clipping (W-Clip) modifies the probability assigned to incorrect classes when the model makes a wrong prediction, providing a more stable and reliable distribution for the student.
Figure 2: Effect of calibration on uncertainty and distributions.
Figure 3: Effect of DUS on predictive calibration. Across datasets, DUS reduces overconfident predictions by reshaping the teacher distribution, leading to a controlled underconfident shift. Although this increases ECE due to lower confidence than accuracy, the behavior is safer than overconfidence and yields more reliable uncertainty estimates for downstream OOD detection and risk-sensitive settings.
Figure 4: Learning curves illustrating the convergence and generalization behavior of our method. Our approach demonstrates smoother optimization and stronger validation performance throughout training.

Theorems & Definitions (1)

Theorem 1: Uniqueness of Projection

Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

TL;DR

Abstract

Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (1)