Table of Contents
Fetching ...

Knowledge Distillation: Enhancing Neural Network Compression with Integrated Gradients

David E. Hernandez, Jose Ramon Chang, Torbjörn E. M. Nordling

TL;DR

The paper addresses the challenge of deploying accurate neural networks on resource-constrained devices by introducing a novel IG-guided Knowledge Distillation framework. It augments traditional KD with precomputed Integrated Gradients overlays used as data augmentation, enabling a significantly compressed student (4.1x reduction) to achieve near-teacher performance on CIFAR-10 while delivering substantial latency improvements. Empirical results show KD, IG augmentation, and their combination yield progressive gains, with KD+IG achieving 92.45% accuracy and a 10.8x decrease in inference time compared to the teacher. This work contributes a scalable, interpretable compression technique suitable for edge computing, and it provides systematic ablations and hyperparameter guidelines to reproduce and extend the approach.

Abstract

Efficient deployment of deep neural networks on resource-constrained devices demands advanced compression techniques that preserve accuracy and interoperability. This paper proposes a machine learning framework that augments Knowledge Distillation (KD) with Integrated Gradients (IG), an attribution method, to optimise the compression of convolutional neural networks. We introduce a novel data augmentation strategy where IG maps, precomputed from a teacher model, are overlaid onto training images to guide a compact student model toward critical feature representations. This approach leverages the teacher's decision-making insights, enhancing the student's ability to replicate complex patterns with reduced parameters. Experiments on CIFAR-10 demonstrate the efficacy of our method: a student model, compressed 4.1-fold from the MobileNet-V2 teacher, achieves 92.5% classification accuracy, surpassing the baseline student's 91.4% and traditional KD approaches, while reducing inference latency from 140 ms to 13 ms--a tenfold speedup. We perform hyperparameter optimisation for efficient learning. Comprehensive ablation studies dissect the contributions of KD and IG, revealing synergistic effects that boost both performance and model explainability. Our method's emphasis on feature-level guidance via IG distinguishes it from conventional KD, offering a data-driven solution for mining transferable knowledge in neural architectures. This work contributes to machine learning by providing a scalable, interpretable compression technique, ideal for edge computing applications where efficiency and transparency are paramount.

Knowledge Distillation: Enhancing Neural Network Compression with Integrated Gradients

TL;DR

The paper addresses the challenge of deploying accurate neural networks on resource-constrained devices by introducing a novel IG-guided Knowledge Distillation framework. It augments traditional KD with precomputed Integrated Gradients overlays used as data augmentation, enabling a significantly compressed student (4.1x reduction) to achieve near-teacher performance on CIFAR-10 while delivering substantial latency improvements. Empirical results show KD, IG augmentation, and their combination yield progressive gains, with KD+IG achieving 92.45% accuracy and a 10.8x decrease in inference time compared to the teacher. This work contributes a scalable, interpretable compression technique suitable for edge computing, and it provides systematic ablations and hyperparameter guidelines to reproduce and extend the approach.

Abstract

Efficient deployment of deep neural networks on resource-constrained devices demands advanced compression techniques that preserve accuracy and interoperability. This paper proposes a machine learning framework that augments Knowledge Distillation (KD) with Integrated Gradients (IG), an attribution method, to optimise the compression of convolutional neural networks. We introduce a novel data augmentation strategy where IG maps, precomputed from a teacher model, are overlaid onto training images to guide a compact student model toward critical feature representations. This approach leverages the teacher's decision-making insights, enhancing the student's ability to replicate complex patterns with reduced parameters. Experiments on CIFAR-10 demonstrate the efficacy of our method: a student model, compressed 4.1-fold from the MobileNet-V2 teacher, achieves 92.5% classification accuracy, surpassing the baseline student's 91.4% and traditional KD approaches, while reducing inference latency from 140 ms to 13 ms--a tenfold speedup. We perform hyperparameter optimisation for efficient learning. Comprehensive ablation studies dissect the contributions of KD and IG, revealing synergistic effects that boost both performance and model explainability. Our method's emphasis on feature-level guidance via IG distinguishes it from conventional KD, offering a data-driven solution for mining transferable knowledge in neural architectures. This work contributes to machine learning by providing a scalable, interpretable compression technique, ideal for edge computing applications where efficiency and transparency are paramount.

Paper Structure

This paper contains 18 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Relationship between compression factor and accuracy across different knowledge distillation approaches on CIFAR-10. Each line connects a teacher model (left point) to its compressed student counterpart (right point). Our IG-enhanced KD approach (in black) achieves competitive accuracy at 4.1x compression, balancing efficiency and performance. The blue dashed line represents the mean accuracy (91.51%) across studies, while the red dashed line indicates the mean compression factor (9.57x).
  • Figure 2: Knowledge distillation process using integrated gradients for data augmentation. The teacher model (green) employs a temperature parameter $T=\tau$ where $\tau > 1$ in its softmax function to produce soft targets, which, along with the hard labels from the dataset, guide the training of the student model (blue). Integrated gradients (brown) are overlaid with the original images to generate enhanced data that focuses critical features that the student model should use during training.
  • Figure 3: Implementation of IG as a data augmentation technique on CIFAR-10. The top row shows original images from various classes. The middle row displays the Integrated Gradients, highlighting areas significantly influencing the predictions of the teacher model. The bottom row presents overlaid images, combining originals with their respective integrated gradients to emphasise regions of interest.