REDistill: Robust Estimator Distillation for Balancing Robustness and Efficiency
Ondrej Tybl, Lukas Neumann
TL;DR
REDistill frames knowledge distillation as a robust statistical estimation problem by replacing the KL-based loss with a power-divergence loss $D_{\lambda}$, downweighting unreliable teacher predictions while preserving informative logit structure. With $\lambda$ set to $\tfrac{2}{3}$, the method achieves a principled balance between efficiency and robustness, integrating seamlessly into existing KD pipelines using only logits. Empirical results on CIFAR-100 and ImageNet-1k show consistent gains across diverse teacher–student pairs, and REDistill combines synergistically with other distillation losses to reach state-of-the-art performance under model-agnostic and model-specific protocols. The approach requires no task-specific hyperparameter tuning and remains robust to data augmentation, highlighting its practical impact for broad KD applications.
Abstract
Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student by aligning their predictive distributions. However, conventional KD formulations - typically based on Kullback-Leibler divergence - assume that the teacher provides reliable soft targets. In practice, teacher predictions are often noisy or overconfident, and existing correction-based approaches rely on ad-hoc heuristics and extensive hyper-parameter tuning, which hinders generalization. We introduce REDistill (Robust Estimator Distillation), a simple yet principled framework grounded in robust statistics. REDistill replaces the standard KD objective with a power divergence loss, a generalization of KL divergence that adaptively downweights unreliable teacher output while preserving informative logit relationships. This formulation provides a unified and interpretable treatment of teacher noise, requires only logits, integrates seamlessly into existing KD pipelines, and incurs negligible computational overhead. Extensive experiments on CIFAR-100 and ImageNet-1k demonstrate that REDistill consistently improves student accuracy in diverse teacher-student architectures. Remarkably, it achieves these gains without model-specific hyper-parameter tuning, underscoring its robustness and strong generalization to unseen teacher-student pairs.
