Multi-Level Decoupled Relational Distillation for Heterogeneous Architectures
Yaoxin Yang, Peng Ye, Weihao Lin, Kangcong Li, Yan Wen, Jia Hao, Tao Chen
TL;DR
Cross-architecture knowledge transfer often loses dark knowledge while trying to preserve correct-class confidence. MLDR-KD introduces Decoupled Finegrained Relation Alignment (DFRA) to split logits into class-wise and sample-wise relations and aligns them with KL-based losses, alongside Multi-Scale Dynamic Fusion (MSDF) to fuse multiscale feature-derived logits, all within a unified cross-architecture framework. The approach demonstrates gains across CNNs, Transformers, MLPs, and Mambas on CIFAR-100 and Tiny-ImageNet, with improvements up to $4.86$ percentage points and $2.78$ percentage points respectively, indicating robustness and generality in heterogeneous distillation. By preserving dark knowledge while maintaining confidence in the correct class, MLDR-KD offers a practical path to deploying cross-architecture distillation in real-world settings, supported by extensive experiments and ablations.
Abstract
Heterogeneous distillation is an effective way to transfer knowledge from cross-architecture teacher models to student models. However, existing heterogeneous distillation methods do not take full advantage of the dark knowledge hidden in the teacher's output, limiting their performance.To this end, we propose a novel framework named Multi-Level Decoupled Relational Knowledge Distillation (MLDR-KD) to unleash the potential of relational distillation in heterogeneous distillation. Concretely, we first introduce Decoupled Finegrained Relation Alignment (DFRA) in both logit and feature levels to balance the trade-off between distilled dark knowledge and the confidence in the correct category of the heterogeneous teacher model. Then, Multi-Scale Dynamic Fusion (MSDF) module is applied to dynamically fuse the projected logits of multiscale features at different stages in student model, further improving performance of our method in feature level. We verify our method on four architectures (CNNs, Transformers, MLPs and Mambas), two datasets (CIFAR-100 and Tiny-ImageNet). Compared with the best available method, our MLDR-KD improves student model performance with gains of up to 4.86% on CIFAR-100 and 2.78% on Tiny-ImageNet datasets respectively, showing robustness and generality in heterogeneous distillation. Code will be released soon.
