Table of Contents
Fetching ...

UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations

Fengming Yu, Haiwei Pan, Kejia Zhang, Jian Guan, Haiying Jiang

TL;DR

UHKD tackles heterogeneous knowledge distillation by transferring intermediate knowledge through frequency-domain representations, bridging semantic gaps across diverse architectures. It introduces a dual-module pipeline: FTM converts teacher features into compact, refined frequency-domain representations, while FAM learns to align student features into the same spectral space. The framework is trained with a joint objective that fuses frequency-domain MSE with logits-based KL and standard cross-entropy losses, yielding consistent gains over state-of-the-art heterogeneous KD methods on CIFAR-100 and ImageNet-1K, and maintaining robustness in homogeneous settings. Empirical results, ablations, and visual analyses confirm that frequency-domain representations effectively capture global semantics and mitigate architectural discrepancies, enabling scalable and efficient cross-architecture knowledge transfer.

Abstract

Knowledge distillation (KD) is an effective model compression technique that transfers knowledge from a high-performance teacher to a lightweight student, reducing computational and storage costs while maintaining competitive accuracy. However, most existing KD methods are tailored for homogeneous models and perform poorly in heterogeneous settings, particularly when intermediate features are involved. Semantic discrepancies across architectures hinder effective use of intermediate representations from the teacher model, while prior heterogeneous KD studies mainly focus on the logits space, underutilizing rich semantic information in intermediate layers. To address this, Unified Heterogeneous Knowledge Distillation (UHKD) is proposed, a framework that leverages intermediate features in the frequency domain for cross-architecture transfer. Frequency-domain representations are leveraged to capture global semantic knowledge and mitigate representational discrepancies between heterogeneous teacher-student pairs. Specifically, a Feature Transformation Module (FTM) generates compact frequency-domain representations of teacher features, while a learnable Feature Alignment Module (FAM) projects student features and aligns them via multi-level matching. Training is guided by a joint objective combining mean squared error on intermediate features with Kullback-Leibler divergence on logits. Extensive experiments on CIFAR-100 and ImageNet-1K demonstrate the effectiveness of the proposed approach, achieving maximum gains of 5.59% and 0.83% over the latest heterogeneous distillation method on the two datasets, respectively. Code will be released soon.

UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations

TL;DR

UHKD tackles heterogeneous knowledge distillation by transferring intermediate knowledge through frequency-domain representations, bridging semantic gaps across diverse architectures. It introduces a dual-module pipeline: FTM converts teacher features into compact, refined frequency-domain representations, while FAM learns to align student features into the same spectral space. The framework is trained with a joint objective that fuses frequency-domain MSE with logits-based KL and standard cross-entropy losses, yielding consistent gains over state-of-the-art heterogeneous KD methods on CIFAR-100 and ImageNet-1K, and maintaining robustness in homogeneous settings. Empirical results, ablations, and visual analyses confirm that frequency-domain representations effectively capture global semantics and mitigate architectural discrepancies, enabling scalable and efficient cross-architecture knowledge transfer.

Abstract

Knowledge distillation (KD) is an effective model compression technique that transfers knowledge from a high-performance teacher to a lightweight student, reducing computational and storage costs while maintaining competitive accuracy. However, most existing KD methods are tailored for homogeneous models and perform poorly in heterogeneous settings, particularly when intermediate features are involved. Semantic discrepancies across architectures hinder effective use of intermediate representations from the teacher model, while prior heterogeneous KD studies mainly focus on the logits space, underutilizing rich semantic information in intermediate layers. To address this, Unified Heterogeneous Knowledge Distillation (UHKD) is proposed, a framework that leverages intermediate features in the frequency domain for cross-architecture transfer. Frequency-domain representations are leveraged to capture global semantic knowledge and mitigate representational discrepancies between heterogeneous teacher-student pairs. Specifically, a Feature Transformation Module (FTM) generates compact frequency-domain representations of teacher features, while a learnable Feature Alignment Module (FAM) projects student features and aligns them via multi-level matching. Training is guided by a joint objective combining mean squared error on intermediate features with Kullback-Leibler divergence on logits. Extensive experiments on CIFAR-100 and ImageNet-1K demonstrate the effectiveness of the proposed approach, achieving maximum gains of 5.59% and 0.83% over the latest heterogeneous distillation method on the two datasets, respectively. Code will be released soon.

Paper Structure

This paper contains 32 sections, 11 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Intermediate feature visualization of different architectures. Left: original image. Top right: ViT-S (teacher) intermediate feature. Bottom right: ResNet-18 (student) intermediate feature. Middle: difference between teacher and student features.
  • Figure 2: Overview of unified heterogeneous knowledge distillation. (a) The UHKD framework aligns teacher and student intermediate features in the frequency domain for effective knowledge transfer; (b) FTM module efficiently captures global representations of the teacher model through Spectral Representation Refinement (SRR) and Sequence Representation Transformation (SRT); (c) FAM module adapts student features through Spectral Channel Alignment (SCA) and Sequence Representation Alignment (SRA) to match the frequency-domain features of teacher model.
  • Figure 3: Visualization of intermediate features before and after UHKD. (a) Before UHKD; (b) After UHKD. In each case, the left column shows the original image, the top and bottom rows show feature maps from different stages of the Swin-T teacher and ResNet-18 student, and the middle row shows their difference maps.
  • Figure 4: Comparison of feature similarities before and after UHKD between Swin-T and ResNet-18. Red bars denote cosine similarity, and blue bars denote Pearson correlation.
  • Figure 5: Comparison of feature similarities before and after UHKD between Swin-T and ResMLP-S12. Red bars denote cosine similarity, and blue bars denote Pearson correlation.
  • ...and 5 more figures