Table of Contents
Fetching ...

Harmonizing knowledge Transfer in Neural Network with Unified Distillation

Yaomin Huang, Zaomin Yan, Chaomin Shen, Faming Fang, Guixu Zhang

TL;DR

This paper aggregate features from intermediate layers into a comprehensive representation, effectively gathering semantic information from different stages and scales, thereby allowing for knowledge distillation through a unified distribution constraint at different stages of the network.

Abstract

Knowledge distillation (KD), known for its ability to transfer knowledge from a cumbersome network (teacher) to a lightweight one (student) without altering the architecture, has been garnering increasing attention. Two primary categories emerge within KD methods: feature-based, focusing on intermediate layers' features, and logits-based, targeting the final layer's logits. This paper introduces a novel perspective by leveraging diverse knowledge sources within a unified KD framework. Specifically, we aggregate features from intermediate layers into a comprehensive representation, effectively gathering semantic information from different stages and scales. Subsequently, we predict the distribution parameters from this representation. These steps transform knowledge from the intermediate layers into corresponding distributive forms, thereby allowing for knowledge distillation through a unified distribution constraint at different stages of the network, ensuring the comprehensiveness and coherence of knowledge transfer. Numerous experiments were conducted to validate the effectiveness of the proposed method.

Harmonizing knowledge Transfer in Neural Network with Unified Distillation

TL;DR

This paper aggregate features from intermediate layers into a comprehensive representation, effectively gathering semantic information from different stages and scales, thereby allowing for knowledge distillation through a unified distribution constraint at different stages of the network.

Abstract

Knowledge distillation (KD), known for its ability to transfer knowledge from a cumbersome network (teacher) to a lightweight one (student) without altering the architecture, has been garnering increasing attention. Two primary categories emerge within KD methods: feature-based, focusing on intermediate layers' features, and logits-based, targeting the final layer's logits. This paper introduces a novel perspective by leveraging diverse knowledge sources within a unified KD framework. Specifically, we aggregate features from intermediate layers into a comprehensive representation, effectively gathering semantic information from different stages and scales. Subsequently, we predict the distribution parameters from this representation. These steps transform knowledge from the intermediate layers into corresponding distributive forms, thereby allowing for knowledge distillation through a unified distribution constraint at different stages of the network, ensuring the comprehensiveness and coherence of knowledge transfer. Numerous experiments were conducted to validate the effectiveness of the proposed method.
Paper Structure (27 sections, 7 equations, 5 figures, 7 tables)

This paper contains 27 sections, 7 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison of Methodologies in feature-based knowledge distillation (MSE), logits-based knowledge distillation (KL), and the proposed UniKD. UniKD achieves unified knowledge distillation across different network layers, considering information at various levels while circumventing the incoherence from direct combinations.
  • Figure 2: The overall framework of the proposed UniKD. Our approach primarily comprises two modules. Initially, the features from the intermediate layers are aggregated using the stacked Adaptive Features Fusion (AFF) modules, followed by the derivation of the corresponding feature distribution through the Feature Distribution Prediction module (FDP). Consequently, our Unified Knowledge Distillation (UniKD) facilitates knowledge distillation in both the intermediate layer's features distribution and the final layer's probability distribution under a unified constraint.
  • Figure 3: A comparative analysis of the convergence of test task loss $\mathcal{L}_{CE}$ and knowledge distillation loss $\mathcal{L}_{KD}$ across various types of distillation methods. All experiments were conducted on the CIFAR100 dataset utilizing a ResNet32$\times$4-ResNet8$\times$4 teacher-student pair architecture.
  • Figure 4: CDF curve of the differences in logits output by teacher and student in correspondence with various methods.
  • Figure 5: Difference of correlation matrices of student and teacher logits.