Uncertainty-Aware Dual-Student Knowledge Distillation for Efficient Image Classification
Aakash Gore, Anoushka Dey, Aryan Mishra
TL;DR
This work tackles efficient image classification by improving knowledge distillation through an uncertainty-aware dual-student framework. A frozen teacher provides guidance whose influence is modulated by per-sample predictive entropy, via a weight $w(x) = 1 - \frac{H(x)}{\log C}$, and the two heterogeneous students (ResNet-18 and MobileNetV2) learn simultaneously from hard labels, the weighted teacher soft targets, and each other through a peer-loss term. The method achieves notable gains on ImageNet-100, with ResNet-18 reaching $83.84\%$ top-1 and MobileNetV2 reaching $81.46\%$, outperforming traditional KD by $2.04$ and $0.92$ points respectively, while offering different compression ratios ($2.19\times$ and $7.31\times$) relative to the teacher. Ablation studies and uncertainty analyses validate that combining entropy-based weighting with cross-architecture peer learning yields robust improvements and practical deployment benefits for edge-friendly models.
Abstract
Knowledge distillation has emerged as a powerful technique for model compression, enabling the transfer of knowledge from large teacher networks to compact student models. However, traditional knowledge distillation methods treat all teacher predictions equally, regardless of the teacher's confidence in those predictions. This paper proposes an uncertainty-aware dual-student knowledge distillation framework that leverages teacher prediction uncertainty to selectively guide student learning. We introduce a peer-learning mechanism where two heterogeneous student architectures, specifically ResNet-18 and MobileNetV2, learn collaboratively from both the teacher network and each other. Experimental results on ImageNet-100 demonstrate that our approach achieves superior performance compared to baseline knowledge distillation methods, with ResNet-18 achieving 83.84\% top-1 accuracy and MobileNetV2 achieving 81.46\% top-1 accuracy, representing improvements of 2.04\% and 0.92\% respectively over traditional single-student distillation approaches.
