Table of Contents
Fetching ...

Multi-Level Decoupled Relational Distillation for Heterogeneous Architectures

Yaoxin Yang, Peng Ye, Weihao Lin, Kangcong Li, Yan Wen, Jia Hao, Tao Chen

TL;DR

Cross-architecture knowledge transfer often loses dark knowledge while trying to preserve correct-class confidence. MLDR-KD introduces Decoupled Finegrained Relation Alignment (DFRA) to split logits into class-wise and sample-wise relations and aligns them with KL-based losses, alongside Multi-Scale Dynamic Fusion (MSDF) to fuse multiscale feature-derived logits, all within a unified cross-architecture framework. The approach demonstrates gains across CNNs, Transformers, MLPs, and Mambas on CIFAR-100 and Tiny-ImageNet, with improvements up to $4.86$ percentage points and $2.78$ percentage points respectively, indicating robustness and generality in heterogeneous distillation. By preserving dark knowledge while maintaining confidence in the correct class, MLDR-KD offers a practical path to deploying cross-architecture distillation in real-world settings, supported by extensive experiments and ablations.

Abstract

Heterogeneous distillation is an effective way to transfer knowledge from cross-architecture teacher models to student models. However, existing heterogeneous distillation methods do not take full advantage of the dark knowledge hidden in the teacher's output, limiting their performance.To this end, we propose a novel framework named Multi-Level Decoupled Relational Knowledge Distillation (MLDR-KD) to unleash the potential of relational distillation in heterogeneous distillation. Concretely, we first introduce Decoupled Finegrained Relation Alignment (DFRA) in both logit and feature levels to balance the trade-off between distilled dark knowledge and the confidence in the correct category of the heterogeneous teacher model. Then, Multi-Scale Dynamic Fusion (MSDF) module is applied to dynamically fuse the projected logits of multiscale features at different stages in student model, further improving performance of our method in feature level. We verify our method on four architectures (CNNs, Transformers, MLPs and Mambas), two datasets (CIFAR-100 and Tiny-ImageNet). Compared with the best available method, our MLDR-KD improves student model performance with gains of up to 4.86% on CIFAR-100 and 2.78% on Tiny-ImageNet datasets respectively, showing robustness and generality in heterogeneous distillation. Code will be released soon.

Multi-Level Decoupled Relational Distillation for Heterogeneous Architectures

TL;DR

Cross-architecture knowledge transfer often loses dark knowledge while trying to preserve correct-class confidence. MLDR-KD introduces Decoupled Finegrained Relation Alignment (DFRA) to split logits into class-wise and sample-wise relations and aligns them with KL-based losses, alongside Multi-Scale Dynamic Fusion (MSDF) to fuse multiscale feature-derived logits, all within a unified cross-architecture framework. The approach demonstrates gains across CNNs, Transformers, MLPs, and Mambas on CIFAR-100 and Tiny-ImageNet, with improvements up to percentage points and percentage points respectively, indicating robustness and generality in heterogeneous distillation. By preserving dark knowledge while maintaining confidence in the correct class, MLDR-KD offers a practical path to deploying cross-architecture distillation in real-world settings, supported by extensive experiments and ablations.

Abstract

Heterogeneous distillation is an effective way to transfer knowledge from cross-architecture teacher models to student models. However, existing heterogeneous distillation methods do not take full advantage of the dark knowledge hidden in the teacher's output, limiting their performance.To this end, we propose a novel framework named Multi-Level Decoupled Relational Knowledge Distillation (MLDR-KD) to unleash the potential of relational distillation in heterogeneous distillation. Concretely, we first introduce Decoupled Finegrained Relation Alignment (DFRA) in both logit and feature levels to balance the trade-off between distilled dark knowledge and the confidence in the correct category of the heterogeneous teacher model. Then, Multi-Scale Dynamic Fusion (MSDF) module is applied to dynamically fuse the projected logits of multiscale features at different stages in student model, further improving performance of our method in feature level. We verify our method on four architectures (CNNs, Transformers, MLPs and Mambas), two datasets (CIFAR-100 and Tiny-ImageNet). Compared with the best available method, our MLDR-KD improves student model performance with gains of up to 4.86% on CIFAR-100 and 2.78% on Tiny-ImageNet datasets respectively, showing robustness and generality in heterogeneous distillation. Code will be released soon.

Paper Structure

This paper contains 13 sections, 15 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Conceptual comparisons of different knowledge distillation methods. Our Decoupled Relational KD first decouples the logits of teacher and student into multiple finegrained relationships between different classes under each sample and different samples under each class, and then aligns the relationships. In our method, Decoupled Relational KD is applied to both logit and multiscale feature levels (namely MLDR-KD).
  • Figure 2: Comparisons of feature visualizations when using kinds of knowledge distillation methods. The teacher is Vision Mamba Tiny vim, the student is ResNet-18 he2016deep. The direct use of conventional relational KD underperforms on heterogeneous distillation, while our MLDR-KD could greatly improve this problem.
  • Figure 3: Overview of the proposed MLDR-KD framework. It comprises two main components: Decoupled Finegrained Relation Alignment (DFRA), and Multi-Scale Dynamic Fusion (MSDF). In DFRA, after obtaining the logits of teacher and student, we decouple them into class-wise relation and sample-wise relation, and then align these relationships via Kullback-Leibler divergence. DFRA is applied to both logit and feature levels. MSDF further improves the effect of feature-level DFRA by dynamically fusing feature maps of student.
  • Figure 4: Comparisons of the averaged prediction distribution of all samples of single category among OFA-KD ((a),(d),(g)), RKD ((b),(e),(h)), and our MLDR-KD ((c),(f),(i)). Three black boxes represent three randomly selected categories. In each figure (left), we show the logit of category in addition to the correct category. In each figure (right), the logit of the correct category is displayed. From the figure we can see that our method has high confidence for the correct category while transferring abundant dark knowledge in the teacher model logit.
  • Figure 5: Comparisons of feature visualizations between OFA-KD and our MLDR-KD. The teacher is Vision Mamba Tiny, the student is ResNet-18. Clearly, our approach makes the student model more focused on the target across various samples.