Table of Contents
Fetching ...

BD-KD: Balancing the Divergences for Online Knowledge Distillation

Ibtihel Amara, Nazanin Sepahvand, Brett H. Meyer, Warren J. Gross, James J. Clark

TL;DR

BD-KD addresses the challenge of training compact, well-calibrated models for edge devices by balancing divergences in online knowledge distillation. It replaces the standard student objective with a dual KL framework that combines forward KL $KL(p^t||p^s)$ and reverse KL $KL(p^s||p^t)$ in a sample-wise manner, guided by the entropy gap between teacher and student predictions, while the teacher objective uses reverse KL as a regularizer. This student-centered approach improves both accuracy and calibration without post-hoc recalibration, demonstrated across CIFAR-10/100, TinyImageNet, and ImageNet with multiple architectures and teacher-student configurations. Realized gains include reduced expected calibration error (ECE) and competitive/top-tier performance relative to online and offline KD baselines, along with mitigation of the capacity-gap issue and support for multi-network extensions. The method is modular and adaptable to various KD losses and tasks, offering practical impact for deploying reliable, compact models on resource-constrained devices.

Abstract

We address the challenge of producing trustworthy and accurate compact models for edge devices. While Knowledge Distillation (KD) has improved model compression in terms of achieving high accuracy performance, calibration of these compact models has been overlooked. We introduce BD-KD (Balanced Divergence Knowledge Distillation), a framework for logit-based online KD. BD-KD enhances both accuracy and model calibration simultaneously, eliminating the need for post-hoc recalibration techniques, which add computational overhead to the overall training pipeline and degrade performance. Our method encourages student-centered training by adjusting the conventional online distillation loss on both the student and teacher losses, employing sample-wise weighting of forward and reverse Kullback-Leibler divergence. This strategy balances student network confidence and boosts performance. Experiments across CIFAR10, CIFAR100, TinyImageNet, and ImageNet datasets, and various architectures demonstrate improved calibration and accuracy compared to recent online KD methods.

BD-KD: Balancing the Divergences for Online Knowledge Distillation

TL;DR

BD-KD addresses the challenge of training compact, well-calibrated models for edge devices by balancing divergences in online knowledge distillation. It replaces the standard student objective with a dual KL framework that combines forward KL and reverse KL in a sample-wise manner, guided by the entropy gap between teacher and student predictions, while the teacher objective uses reverse KL as a regularizer. This student-centered approach improves both accuracy and calibration without post-hoc recalibration, demonstrated across CIFAR-10/100, TinyImageNet, and ImageNet with multiple architectures and teacher-student configurations. Realized gains include reduced expected calibration error (ECE) and competitive/top-tier performance relative to online and offline KD baselines, along with mitigation of the capacity-gap issue and support for multi-network extensions. The method is modular and adaptable to various KD losses and tasks, offering practical impact for deploying reliable, compact models on resource-constrained devices.

Abstract

We address the challenge of producing trustworthy and accurate compact models for edge devices. While Knowledge Distillation (KD) has improved model compression in terms of achieving high accuracy performance, calibration of these compact models has been overlooked. We introduce BD-KD (Balanced Divergence Knowledge Distillation), a framework for logit-based online KD. BD-KD enhances both accuracy and model calibration simultaneously, eliminating the need for post-hoc recalibration techniques, which add computational overhead to the overall training pipeline and degrade performance. Our method encourages student-centered training by adjusting the conventional online distillation loss on both the student and teacher losses, employing sample-wise weighting of forward and reverse Kullback-Leibler divergence. This strategy balances student network confidence and boosts performance. Experiments across CIFAR10, CIFAR100, TinyImageNet, and ImageNet datasets, and various architectures demonstrate improved calibration and accuracy compared to recent online KD methods.
Paper Structure (20 sections, 15 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 20 sections, 15 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Distillation losses in the proposed framework. (A) depicts the conventional online distillation loss dml. (B) depicts the proposed student-centered distillation loss. The feedback signal from teacher to student takes in a sample-wise weighting of both forward and backward KL. We also exploit the reverse KL for teacher training. The parameters in red font detached (stop gradient flow) during training.
  • Figure 2: (a) Capacity gap Curve; Student (ResNet20) distilled from different teacher capacity (WRN-16-2 to WRN-16-8) on CIFAR100. (b) TSNE visualization van2008visualizing of the penultimate feature layer of the student model (WRN-16-1) trained with SwitOKD on the test images from CIFAR10. (c) TSNE visualization van2008visualizing of the penultimate feature layer of the student model (WRN-16-1) trained with BD-KD on the test images from CIFAR10.
  • Figure 3: Calibration curves (ResNet 20 student, teacher WRN-16-8). Online KD methods improve calibration compared to vanilla KD, and of the tested online methods BD-KD improves calibration the most.
  • Figure 4: (a) Accuracy gap between teacher (ResNet32x4) and student (ShuffleNetV2) networks on the test set of CIFAR-100 dataset using BD-KD and SwitOKD.(b) Accuracy gap between teacher (ResNet50) and student (MBV1) networks on the validation set of the ImageNet dataset using BD-KD and SwitOKD. (c) Evolution of $H_t$ and $H_s$ during training of student (MobileNet V2) and teacher (WRN-16-2) on CIFAR100.
  • Figure 5: Student WRN-16-2 distilled from teacher WRN-40-2 on CIFAR100. (a) Student network trained without weighting (i.e. $\delta_1 = 1$ and $\delta_2 = 1$. (b) Student network trained using our proposed sample-weighting of the KL terms according to Eq. \ref{['eq:teacher_obj']} and Eq.\ref{['eq:student_obj']}.
  • ...and 5 more figures