Table of Contents
Fetching ...

InDistill: Information flow-preserving knowledge distillation for model compression

Ioannis Sarridis, Christos Koutlis, Giorgos Kordopatis-Zilos, Ioannis Kompatsiaris, Symeon Papadopoulos

TL;DR

InDistill introduces a curriculum-learning warmup that preserves the teacher’s information flow paths to improve subsequent knowledge distillation. It handles width disparities with channel pruning and addresses extreme capacity gaps with an auxiliary teacher, enabling direct feature-map transfer. Across CIFAR-10/100 and ImageNet, InDistill consistently boosts multiple KD baselines, improves retrieval representations, and reduces variability, demonstrating practical utility for wide-ranging model compression scenarios. The approach provides a principled, plug-in enhancement to KD that emphasizes information flow, layer-wise difficulty, and architectural alignment to achieve stronger student performance.

Abstract

In this paper, we introduce InDistill, a method that serves as a warmup stage for enhancing Knowledge Distillation (KD) effectiveness. InDistill focuses on transferring critical information flow paths from a heavyweight teacher to a lightweight student. This is achieved via a training scheme based on curriculum learning that considers the distillation difficulty of each layer and the critical learning periods when the information flow paths are established. This procedure can lead to a student model that is better prepared to learn from the teacher. To ensure the applicability of InDistill across a wide range of teacher-student pairs, we also incorporate a pruning operation when there is a discrepancy in the width of the teacher and student layers. This pruning operation reduces the width of the teacher's intermediate layers to match those of the student, allowing direct distillation without the need for an encoding stage. The proposed method is extensively evaluated using various pairs of teacher-student architectures on CIFAR-10, CIFAR-100, and ImageNet datasets demonstrating that preserving the information flow paths consistently increases the performance of the baseline KD approaches on both classification and retrieval settings. The code is available at https://github.com/gsarridis/InDistill.

InDistill: Information flow-preserving knowledge distillation for model compression

TL;DR

InDistill introduces a curriculum-learning warmup that preserves the teacher’s information flow paths to improve subsequent knowledge distillation. It handles width disparities with channel pruning and addresses extreme capacity gaps with an auxiliary teacher, enabling direct feature-map transfer. Across CIFAR-10/100 and ImageNet, InDistill consistently boosts multiple KD baselines, improves retrieval representations, and reduces variability, demonstrating practical utility for wide-ranging model compression scenarios. The approach provides a principled, plug-in enhancement to KD that emphasizes information flow, layer-wise difficulty, and architectural alignment to achieve stronger student performance.

Abstract

In this paper, we introduce InDistill, a method that serves as a warmup stage for enhancing Knowledge Distillation (KD) effectiveness. InDistill focuses on transferring critical information flow paths from a heavyweight teacher to a lightweight student. This is achieved via a training scheme based on curriculum learning that considers the distillation difficulty of each layer and the critical learning periods when the information flow paths are established. This procedure can lead to a student model that is better prepared to learn from the teacher. To ensure the applicability of InDistill across a wide range of teacher-student pairs, we also incorporate a pruning operation when there is a discrepancy in the width of the teacher and student layers. This pruning operation reduces the width of the teacher's intermediate layers to match those of the student, allowing direct distillation without the need for an encoding stage. The proposed method is extensively evaluated using various pairs of teacher-student architectures on CIFAR-10, CIFAR-100, and ImageNet datasets demonstrating that preserving the information flow paths consistently increases the performance of the baseline KD approaches on both classification and retrieval settings. The code is available at https://github.com/gsarridis/InDistill.
Paper Structure (18 sections, 4 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 18 sections, 4 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the proposed InDistill method. In stage 1, InDistill acts as a warmup, transferring the teacher's critical information flow via curriculum learning and direct distillation through pruning. This enhances the effectiveness of the main knowledge distillation process in the next stage 2.
  • Figure 2: $\mathcal{L}_{MSE}$ loss curves. KD is applied from a teacher to a 3-layer student using PKT passalis2018learning loss on the CIFAR-10 dataset. The MSE loss is computed based on the feature maps of images from the validation set.
  • Figure 3: Channel pruning and curriculum learning visualization. In cases of large capacity gap, channel pruning is applied on the teacher's intermediate layers so that teacher and student $l \in (1, L_g-1)$ layer widths coincide. Then, curriculum learning is applied, where each $l$ layer is trained for $\mathcal{S}_l$ epochs. After applying InDistill, any knowledge distillation loss can be employed as $\mathcal{L}_{KD}$ along with the $\mathcal{L}_{task}$ task loss.
  • Figure 4: Evaluation on information flow preservation.$\mathcal{L}_{MI}$ loss curves during KD on CIFAR-10. Comparison of InDistill with the proposed, WD passalis2020heterogeneous, or without curriculum learning. PKT is used as the main KD approach.
  • Figure 5: Impact of curriculum parameters. Student performance for different values of $a$ and $b$ on CIFAR-10. PKT is used as the main KD approach.