Table of Contents
Fetching ...

ERDE: Entropy-Regularized Distillation for Early-exit

Martial Guidez, Stefan Duffner, Yannick Alpou, Oscar Röth, Christophe Garcia

TL;DR

The paper addresses the high computational cost of CNNs for image classification in real-time and edge scenarios. It proposes ERDE, a method that fuses early-exit dynamic architectures with knowledge distillation and introduces an entropy-based loss $L_E$ to handle cases where the teacher is uncertain or incorrect, enabling training across multiple exits. Empirical results on CIFAR10, CIFAR100, and SVHN show substantial compute reductions (up to about 10x MACs) with negligible or even improved accuracy, outperforming standard KD across all tested exit thresholds. The approach yields runtime-configurable models that balance accuracy and efficiency, paving the way for broader use of knowledge distillation in dynamic, edge-friendly architectures.

Abstract

Although deep neural networks and in particular Convolutional Neural Networks have demonstrated state-of-the-art performance in image classification with relatively high efficiency, they still exhibit high computational costs, often rendering them impractical for real-time and edge applications. Therefore, a multitude of compression techniques have been developed to reduce these costs while maintaining accuracy. In addition, dynamic architectures have been introduced to modulate the level of compression at execution time, which is a desirable property in many resource-limited application scenarios. The proposed method effectively integrates two well-established optimization techniques: early exits and knowledge distillation, where a reduced student early-exit model is trained from a more complex teacher early-exit model. The primary contribution of this research lies in the approach for training the student early-exit model. In comparison to the conventional Knowledge Distillation loss, our approach incorporates a new entropy-based loss for images where the teacher's classification was incorrect. The proposed method optimizes the trade-off between accuracy and efficiency, thereby achieving significant reductions in computational complexity without compromising classification performance. The validity of this approach is substantiated by experimental results on image classification datasets CIFAR10, CIFAR100 and SVHN, which further opens new research perspectives for Knowledge Distillation in other contexts.

ERDE: Entropy-Regularized Distillation for Early-exit

TL;DR

The paper addresses the high computational cost of CNNs for image classification in real-time and edge scenarios. It proposes ERDE, a method that fuses early-exit dynamic architectures with knowledge distillation and introduces an entropy-based loss to handle cases where the teacher is uncertain or incorrect, enabling training across multiple exits. Empirical results on CIFAR10, CIFAR100, and SVHN show substantial compute reductions (up to about 10x MACs) with negligible or even improved accuracy, outperforming standard KD across all tested exit thresholds. The approach yields runtime-configurable models that balance accuracy and efficiency, paving the way for broader use of knowledge distillation in dynamic, edge-friendly architectures.

Abstract

Although deep neural networks and in particular Convolutional Neural Networks have demonstrated state-of-the-art performance in image classification with relatively high efficiency, they still exhibit high computational costs, often rendering them impractical for real-time and edge applications. Therefore, a multitude of compression techniques have been developed to reduce these costs while maintaining accuracy. In addition, dynamic architectures have been introduced to modulate the level of compression at execution time, which is a desirable property in many resource-limited application scenarios. The proposed method effectively integrates two well-established optimization techniques: early exits and knowledge distillation, where a reduced student early-exit model is trained from a more complex teacher early-exit model. The primary contribution of this research lies in the approach for training the student early-exit model. In comparison to the conventional Knowledge Distillation loss, our approach incorporates a new entropy-based loss for images where the teacher's classification was incorrect. The proposed method optimizes the trade-off between accuracy and efficiency, thereby achieving significant reductions in computational complexity without compromising classification performance. The validity of this approach is substantiated by experimental results on image classification datasets CIFAR10, CIFAR100 and SVHN, which further opens new research perspectives for Knowledge Distillation in other contexts.

Paper Structure

This paper contains 11 sections, 5 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Our ERDE architecture and training approach for 4 teacher and student blocks (in blue and green respectively). The $\mathcal{L}^i$ correspond to the different loss functions ($\mathcal{L}_{\text{CE}}$, $\mathcal{L}_{\text{KD}}$, and $\mathcal{L}_{\text{E}}$) at the $i$-th exit. At inference, only the compressed student model is used.
  • Figure 2: Test accuracy as a function of average MACs for different datasets and training strategies using ResNet34 as teacher and ResNet10 as student.