Table of Contents
Fetching ...

REDistill: Robust Estimator Distillation for Balancing Robustness and Efficiency

Ondrej Tybl, Lukas Neumann

TL;DR

REDistill frames knowledge distillation as a robust statistical estimation problem by replacing the KL-based loss with a power-divergence loss $D_{\lambda}$, downweighting unreliable teacher predictions while preserving informative logit structure. With $\lambda$ set to $\tfrac{2}{3}$, the method achieves a principled balance between efficiency and robustness, integrating seamlessly into existing KD pipelines using only logits. Empirical results on CIFAR-100 and ImageNet-1k show consistent gains across diverse teacher–student pairs, and REDistill combines synergistically with other distillation losses to reach state-of-the-art performance under model-agnostic and model-specific protocols. The approach requires no task-specific hyperparameter tuning and remains robust to data augmentation, highlighting its practical impact for broad KD applications.

Abstract

Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student by aligning their predictive distributions. However, conventional KD formulations - typically based on Kullback-Leibler divergence - assume that the teacher provides reliable soft targets. In practice, teacher predictions are often noisy or overconfident, and existing correction-based approaches rely on ad-hoc heuristics and extensive hyper-parameter tuning, which hinders generalization. We introduce REDistill (Robust Estimator Distillation), a simple yet principled framework grounded in robust statistics. REDistill replaces the standard KD objective with a power divergence loss, a generalization of KL divergence that adaptively downweights unreliable teacher output while preserving informative logit relationships. This formulation provides a unified and interpretable treatment of teacher noise, requires only logits, integrates seamlessly into existing KD pipelines, and incurs negligible computational overhead. Extensive experiments on CIFAR-100 and ImageNet-1k demonstrate that REDistill consistently improves student accuracy in diverse teacher-student architectures. Remarkably, it achieves these gains without model-specific hyper-parameter tuning, underscoring its robustness and strong generalization to unseen teacher-student pairs.

REDistill: Robust Estimator Distillation for Balancing Robustness and Efficiency

TL;DR

REDistill frames knowledge distillation as a robust statistical estimation problem by replacing the KL-based loss with a power-divergence loss , downweighting unreliable teacher predictions while preserving informative logit structure. With set to , the method achieves a principled balance between efficiency and robustness, integrating seamlessly into existing KD pipelines using only logits. Empirical results on CIFAR-100 and ImageNet-1k show consistent gains across diverse teacher–student pairs, and REDistill combines synergistically with other distillation losses to reach state-of-the-art performance under model-agnostic and model-specific protocols. The approach requires no task-specific hyperparameter tuning and remains robust to data augmentation, highlighting its practical impact for broad KD applications.

Abstract

Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student by aligning their predictive distributions. However, conventional KD formulations - typically based on Kullback-Leibler divergence - assume that the teacher provides reliable soft targets. In practice, teacher predictions are often noisy or overconfident, and existing correction-based approaches rely on ad-hoc heuristics and extensive hyper-parameter tuning, which hinders generalization. We introduce REDistill (Robust Estimator Distillation), a simple yet principled framework grounded in robust statistics. REDistill replaces the standard KD objective with a power divergence loss, a generalization of KL divergence that adaptively downweights unreliable teacher output while preserving informative logit relationships. This formulation provides a unified and interpretable treatment of teacher noise, requires only logits, integrates seamlessly into existing KD pipelines, and incurs negligible computational overhead. Extensive experiments on CIFAR-100 and ImageNet-1k demonstrate that REDistill consistently improves student accuracy in diverse teacher-student architectures. Remarkably, it achieves these gains without model-specific hyper-parameter tuning, underscoring its robustness and strong generalization to unseen teacher-student pairs.
Paper Structure (17 sections, 36 equations, 2 figures, 6 tables)

This paper contains 17 sections, 36 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Knowledge distillation trains a smaller student model by matching its outputs to a larger teacher’s logits. However, teachers -- even large ones -- can be unreliable, leading to degraded student performance. Our method introduces a robust distillation loss that adapts the contribution of each training example based on how reliable the typical teacher’s prediction is for the given input. This enables the student to dynamically trust or distrust the teacher beyond simply checking whether the top logit is correct. Our approach is theoretically sound and simple to integrate with existing distillation methods.
  • Figure 2: The divergence $\operatorname{D}_\lambda$ corresponds to the $\operatorname{KL}$ divergence for $\lambda=0$. For other values, logarithm as a measure of surprise in the divergence computation is replaced by its smooth relaxation (known as $\left(1-\lambda\right)$ -logarithm, see \ref{['eq:lambda_log']}). (a) shows graph of $\left(1-\lambda\right)$ -logarithm, (b) shows $\operatorname{D}_\lambda(P \| Q)$ between $P=\left(1/3,2/3\right)$ and $Q=\left(q, 1-q\right)$ as a function of $q$ for different $\lambda$ values.