Table of Contents
Fetching ...

Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation

Jhe-Hao Lin, Yi Yao, Chan-Feng Hsu, Hongxia Xie, Hong-Han Shuai, Wen-Huang Cheng

TL;DR

This work tackles cross-architecture knowledge distillation by addressing two main bottlenecks: view mismatch between heterogeneous architectures and teacher unawareness of the student’s learning progress. It introduces Perspective-Aware Teaching (PAT), a feature-based distillation framework that combines Region-Aware Attention (RAA) to align perspectives and Adaptive Feedback Prompts (AFP) to adapt teacher features via student feedback, all within a unified loss $L_{PAT} = L_{CE} + \alpha L_{KL} + \beta L_{FD} + \gamma L_{Reg}$. The method preserves spatial information (via $L_{FD}$ with Hierarchical Context Loss) and maintains teacher discriminativeness (through $L_{Reg}$), enabling effective distillation across CNNs, ViTs, and MLPs. Empirical results on CIFAR-100, ImageNet, and COCO demonstrate state-of-the-art improvements over prior KD approaches, with notable gains on classification and detection tasks, highlighting practical applicability to diverse downstream workloads.

Abstract

Knowledge distillation (KD) involves transferring knowledge from a pre-trained heavy teacher model to a lighter student model, thereby reducing the inference cost while maintaining comparable effectiveness. Prior KD techniques typically assume homogeneity between the teacher and student models. However, as technology advances, a wide variety of architectures have emerged, ranging from initial Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs), and Multi-Level Perceptrons (MLPs). Consequently, developing a universal KD framework compatible with any architecture has become an important research topic. In this paper, we introduce a perspective-aware teaching (PAT) KD framework to enable feature distillation across diverse architectures. Our framework comprises two key components. First, we design prompt tuning blocks that incorporate student feedback, allowing teacher features to adapt to the student model's learning process. Second, we propose region-aware attention to mitigate the view mismatch problem between heterogeneous architectures. By leveraging these two modules, effective distillation of intermediate features can be achieved across heterogeneous architectures. Extensive experiments on CIFAR, ImageNet, and COCO demonstrate the superiority of the proposed method. Our code is available at https://github.com/jimmylin0979/PAT.git.

Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation

TL;DR

This work tackles cross-architecture knowledge distillation by addressing two main bottlenecks: view mismatch between heterogeneous architectures and teacher unawareness of the student’s learning progress. It introduces Perspective-Aware Teaching (PAT), a feature-based distillation framework that combines Region-Aware Attention (RAA) to align perspectives and Adaptive Feedback Prompts (AFP) to adapt teacher features via student feedback, all within a unified loss . The method preserves spatial information (via with Hierarchical Context Loss) and maintains teacher discriminativeness (through ), enabling effective distillation across CNNs, ViTs, and MLPs. Empirical results on CIFAR-100, ImageNet, and COCO demonstrate state-of-the-art improvements over prior KD approaches, with notable gains on classification and detection tasks, highlighting practical applicability to diverse downstream workloads.

Abstract

Knowledge distillation (KD) involves transferring knowledge from a pre-trained heavy teacher model to a lighter student model, thereby reducing the inference cost while maintaining comparable effectiveness. Prior KD techniques typically assume homogeneity between the teacher and student models. However, as technology advances, a wide variety of architectures have emerged, ranging from initial Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs), and Multi-Level Perceptrons (MLPs). Consequently, developing a universal KD framework compatible with any architecture has become an important research topic. In this paper, we introduce a perspective-aware teaching (PAT) KD framework to enable feature distillation across diverse architectures. Our framework comprises two key components. First, we design prompt tuning blocks that incorporate student feedback, allowing teacher features to adapt to the student model's learning process. Second, we propose region-aware attention to mitigate the view mismatch problem between heterogeneous architectures. By leveraging these two modules, effective distillation of intermediate features can be achieved across heterogeneous architectures. Extensive experiments on CIFAR, ImageNet, and COCO demonstrate the superiority of the proposed method. Our code is available at https://github.com/jimmylin0979/PAT.git.
Paper Structure (22 sections, 8 equations, 3 figures, 8 tables)

This paper contains 22 sections, 8 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: High-level comparison between former state-of-the-art OFA and our method PAT. Due to the view mismatch (red lines) and teacher unawareness (brown dashed lines) problems, OFA can only use final logits to supervise student intermediate features (black solid lines), thereby restricting its utility in downstream tasks. In contrast, PAT enables feature imitation by solving these two issues with RAA and AFP modules respectively.
  • Figure 2: A comprehensive depiction of the general structure of our PAT framework and proposed modules. (a) RAA: student features across all stages are concatenated and fed into an attention module to learn how to integrate a new feature with a perspective similar to that of the teacher model. (b) AFP: a prompt-tuning method is introduced to modify the output stage features of the teacher model with respect to the student model's learning process. Note: Only three stages are shown for convenience. In our experiment, all models are split into four stages.
  • Figure 3: Visulizations of the attention map within RAA from different model pairs. Queries are sorted based on patch and stage position, as the black arrow shows.