Table of Contents
Fetching ...

Uncertainty-Aware Dual-Student Knowledge Distillation for Efficient Image Classification

Aakash Gore, Anoushka Dey, Aryan Mishra

TL;DR

This work tackles efficient image classification by improving knowledge distillation through an uncertainty-aware dual-student framework. A frozen teacher provides guidance whose influence is modulated by per-sample predictive entropy, via a weight $w(x) = 1 - \frac{H(x)}{\log C}$, and the two heterogeneous students (ResNet-18 and MobileNetV2) learn simultaneously from hard labels, the weighted teacher soft targets, and each other through a peer-loss term. The method achieves notable gains on ImageNet-100, with ResNet-18 reaching $83.84\%$ top-1 and MobileNetV2 reaching $81.46\%$, outperforming traditional KD by $2.04$ and $0.92$ points respectively, while offering different compression ratios ($2.19\times$ and $7.31\times$) relative to the teacher. Ablation studies and uncertainty analyses validate that combining entropy-based weighting with cross-architecture peer learning yields robust improvements and practical deployment benefits for edge-friendly models.

Abstract

Knowledge distillation has emerged as a powerful technique for model compression, enabling the transfer of knowledge from large teacher networks to compact student models. However, traditional knowledge distillation methods treat all teacher predictions equally, regardless of the teacher's confidence in those predictions. This paper proposes an uncertainty-aware dual-student knowledge distillation framework that leverages teacher prediction uncertainty to selectively guide student learning. We introduce a peer-learning mechanism where two heterogeneous student architectures, specifically ResNet-18 and MobileNetV2, learn collaboratively from both the teacher network and each other. Experimental results on ImageNet-100 demonstrate that our approach achieves superior performance compared to baseline knowledge distillation methods, with ResNet-18 achieving 83.84\% top-1 accuracy and MobileNetV2 achieving 81.46\% top-1 accuracy, representing improvements of 2.04\% and 0.92\% respectively over traditional single-student distillation approaches.

Uncertainty-Aware Dual-Student Knowledge Distillation for Efficient Image Classification

TL;DR

This work tackles efficient image classification by improving knowledge distillation through an uncertainty-aware dual-student framework. A frozen teacher provides guidance whose influence is modulated by per-sample predictive entropy, via a weight , and the two heterogeneous students (ResNet-18 and MobileNetV2) learn simultaneously from hard labels, the weighted teacher soft targets, and each other through a peer-loss term. The method achieves notable gains on ImageNet-100, with ResNet-18 reaching top-1 and MobileNetV2 reaching , outperforming traditional KD by and points respectively, while offering different compression ratios ( and ) relative to the teacher. Ablation studies and uncertainty analyses validate that combining entropy-based weighting with cross-architecture peer learning yields robust improvements and practical deployment benefits for edge-friendly models.

Abstract

Knowledge distillation has emerged as a powerful technique for model compression, enabling the transfer of knowledge from large teacher networks to compact student models. However, traditional knowledge distillation methods treat all teacher predictions equally, regardless of the teacher's confidence in those predictions. This paper proposes an uncertainty-aware dual-student knowledge distillation framework that leverages teacher prediction uncertainty to selectively guide student learning. We introduce a peer-learning mechanism where two heterogeneous student architectures, specifically ResNet-18 and MobileNetV2, learn collaboratively from both the teacher network and each other. Experimental results on ImageNet-100 demonstrate that our approach achieves superior performance compared to baseline knowledge distillation methods, with ResNet-18 achieving 83.84\% top-1 accuracy and MobileNetV2 achieving 81.46\% top-1 accuracy, representing improvements of 2.04\% and 0.92\% respectively over traditional single-student distillation approaches.

Paper Structure

This paper contains 30 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Traditional knowledge distillation framework with ResNet-18 student. The teacher network provides temperature-scaled soft labels while ground truth provides hard labels for training.
  • Figure 2: Traditional knowledge distillation framework with MobileNetV2 student. This baseline approach uses the same loss configuration as Experiment 1 but with a more compact student architecture.
  • Figure 3: Proposed uncertainty-aware dual-student knowledge distillation framework. The teacher generates uncertainty weights based on prediction entropy, which modulate the soft label guidance. Both students learn from weighted teacher predictions, ground truth labels, and each other through bidirectional peer learning.