Table of Contents
Fetching ...

Heterogeneous Complementary Distillation

Liuchi Xu, Hao Zheng, Lu Wang, Lisheng Xu, Jun Cheng

TL;DR

The paper tackles cross-architecture knowledge distillation by addressing mismatches in spatial feature representations between teachers (e.g., ViTs) and students (e.g., CNNs). It introduces Heterogeneous Complementary Distillation (HCD), which uses a Complementary Feature Mapper (CFM) to map teacher and student features into a shared logits space, then decomposes these logits into multiple sub-logits via Sub-logit Decoupled Distillation (SDD) and enforces diversity with an Orthogonality Loss (OL). The total objective blends cross-entropy, standard KD KL loss, sub-logit KL/CE losses, and the orthogonality term, enabling effective transfer of complementary knowledge while preserving student strengths. Empirical results on CIFAR-100, ImageNet-1K, and fine-grained datasets show that HCD consistently outperforms state-of-the-art KD methods in both heterogeneous and homogeneous settings, with notable gains on challenging benchmarks. The work offers a practical, scalable approach to heterogeneous KD that improves robustness and generalization across diverse architectures.

Abstract

Knowledge distillation (KD)transfers the dark knowledge from a complex teacher to a compact student. However, heterogeneous architecture distillation, such as Vision Transformer (ViT) to ResNet18, faces challenges due to differences in spatial feature representations.Traditional KD methods are mostly designed for homogeneous architectures and hence struggle to effectively address the disparity. Although heterogeneous KD approaches have been developed recently to solve these issues, they often incur high computational costs and complex designs, or overly rely on logit alignment, which limits their ability to leverage the complementary features. To overcome these limitations, we propose Heterogeneous Complementary Distillation (HCD),a simple yet effective framework that integrates complementary teacher and student features to align representations in shared logits.These logits are decomposed and constrained to facilitate diverse knowledge transfer to the student. Specifically, HCD processes the student's intermediate features through convolutional projector and adaptive pooling, concatenates them with teacher's feature from the penultimate layer and then maps them via the Complementary Feature Mapper (CFM) module, comprising fully connected layer,to produce shared logits.We further introduce Sub-logit Decoupled Distillation (SDD) that partitions the shared logits into n sub-logits, which are fused with teacher's logits to rectify classification.To ensure sub-logit diversity and reduce redundant knowledge transfer, we propose an Orthogonality Loss (OL).By preserving student-specific strengths and leveraging teacher knowledge,HCD enhances robustness and generalization in students.Extensive experiments on the CIFAR-100, Fine-grained (e.g., CUB200)and ImageNet-1K datasets demonstrate that HCD outperforms state-of-the-art KD methods,establishing it as an effective solution for heterogeneous KD.

Heterogeneous Complementary Distillation

TL;DR

The paper tackles cross-architecture knowledge distillation by addressing mismatches in spatial feature representations between teachers (e.g., ViTs) and students (e.g., CNNs). It introduces Heterogeneous Complementary Distillation (HCD), which uses a Complementary Feature Mapper (CFM) to map teacher and student features into a shared logits space, then decomposes these logits into multiple sub-logits via Sub-logit Decoupled Distillation (SDD) and enforces diversity with an Orthogonality Loss (OL). The total objective blends cross-entropy, standard KD KL loss, sub-logit KL/CE losses, and the orthogonality term, enabling effective transfer of complementary knowledge while preserving student strengths. Empirical results on CIFAR-100, ImageNet-1K, and fine-grained datasets show that HCD consistently outperforms state-of-the-art KD methods in both heterogeneous and homogeneous settings, with notable gains on challenging benchmarks. The work offers a practical, scalable approach to heterogeneous KD that improves robustness and generalization across diverse architectures.

Abstract

Knowledge distillation (KD)transfers the dark knowledge from a complex teacher to a compact student. However, heterogeneous architecture distillation, such as Vision Transformer (ViT) to ResNet18, faces challenges due to differences in spatial feature representations.Traditional KD methods are mostly designed for homogeneous architectures and hence struggle to effectively address the disparity. Although heterogeneous KD approaches have been developed recently to solve these issues, they often incur high computational costs and complex designs, or overly rely on logit alignment, which limits their ability to leverage the complementary features. To overcome these limitations, we propose Heterogeneous Complementary Distillation (HCD),a simple yet effective framework that integrates complementary teacher and student features to align representations in shared logits.These logits are decomposed and constrained to facilitate diverse knowledge transfer to the student. Specifically, HCD processes the student's intermediate features through convolutional projector and adaptive pooling, concatenates them with teacher's feature from the penultimate layer and then maps them via the Complementary Feature Mapper (CFM) module, comprising fully connected layer,to produce shared logits.We further introduce Sub-logit Decoupled Distillation (SDD) that partitions the shared logits into n sub-logits, which are fused with teacher's logits to rectify classification.To ensure sub-logit diversity and reduce redundant knowledge transfer, we propose an Orthogonality Loss (OL).By preserving student-specific strengths and leveraging teacher knowledge,HCD enhances robustness and generalization in students.Extensive experiments on the CIFAR-100, Fine-grained (e.g., CUB200)and ImageNet-1K datasets demonstrate that HCD outperforms state-of-the-art KD methods,establishing it as an effective solution for heterogeneous KD.

Paper Structure

This paper contains 20 sections, 17 equations, 1 figure, 16 tables, 1 algorithm.

Figures (1)

  • Figure 1: Overview of the proposed HCD framework, which includes the three components.