Table of Contents
Fetching ...

CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation

Shizhe Sun, Wataru Ohyama

TL;DR

CanKD tackles efficient knowledge transfer for dense prediction by enabling cross-pixel relations through a Cross-Attention Non-local block. The method transforms the student feature map with Can and optimizes with $\mathcal{L}_{student} = \mathcal{L}_{task} + \mu \mathcal{L}_{feat}$, adding only an extra loss term. Extensive experiments across COCO object detection, Cityscapes segmentation, and vision foundation-model benchmarks show CanKD consistently outperforms state-of-the-art feature and hybrid distillation methods, sometimes surpassing stronger teachers. The approach is modular and lightweight, enabling broad applicability to FPN-based architectures and offering a practical path to enhanced knowledge transfer in vision tasks.

Abstract

We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention mechanisms to enhance the knowledge transfer process. Unlike traditional self-attention-based distillation methods that align teacher and student feature maps independently, CanKD enables each pixel in the student feature map to dynamically consider all pixels in the teacher feature map. This non-local knowledge transfer more thoroughly captures pixel-wise relationships, improving feature representation learning. Our method introduces only an additional loss function to achieve superior performance compared with existing attention-guided distillation methods. Extensive experiments on object detection and image segmentation tasks demonstrate that CanKD outperforms state-of-the-art feature and hybrid distillation methods. These experimental results highlight CanKD's potential as a new paradigm for attention-guided distillation in computer vision tasks. Code is available at https://github.com/tori-hotaru/CanKD

CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation

TL;DR

CanKD tackles efficient knowledge transfer for dense prediction by enabling cross-pixel relations through a Cross-Attention Non-local block. The method transforms the student feature map with Can and optimizes with , adding only an extra loss term. Extensive experiments across COCO object detection, Cityscapes segmentation, and vision foundation-model benchmarks show CanKD consistently outperforms state-of-the-art feature and hybrid distillation methods, sometimes surpassing stronger teachers. The approach is modular and lightweight, enabling broad applicability to FPN-based architectures and offering a practical path to enhanced knowledge transfer in vision tasks.

Abstract

We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention mechanisms to enhance the knowledge transfer process. Unlike traditional self-attention-based distillation methods that align teacher and student feature maps independently, CanKD enables each pixel in the student feature map to dynamically consider all pixels in the teacher feature map. This non-local knowledge transfer more thoroughly captures pixel-wise relationships, improving feature representation learning. Our method introduces only an additional loss function to achieve superior performance compared with existing attention-guided distillation methods. Extensive experiments on object detection and image segmentation tasks demonstrate that CanKD outperforms state-of-the-art feature and hybrid distillation methods. These experimental results highlight CanKD's potential as a new paradigm for attention-guided distillation in computer vision tasks. Code is available at https://github.com/tori-hotaru/CanKD

Paper Structure

This paper contains 22 sections, 8 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Proposed Cross-attention-based Non-local (Can) block.
  • Figure 2: Overview of the proposed feature-based knowledge distillation framework.
  • Figure 3: Feature-based knowledge distillation for an object detection task. The proposed method uses the feature pyramid network (FPN) lin2017feature for the bottleneck layers. We apply the distillation at each layer of FPN.
  • Figure 5: Visualization heatmaps for different affinity functions. We use RepPoints-X101 yang2019reppoints as the teacher and RepPoints-R50 as the student. These heatmaps are generated from P6 in the FPN layers
  • Figure 6: Visualization of student heatmaps on P6 in FPN layers, where the student is distilled with CanKD heatmaps and teacher heatmaps for the COCO validation dataset. These heatmaps are selected from T:FasterRCNN-R101-S:FasterRCNN-R50 and T:RepPoints-X101-S:RepPoints-R50.
  • ...and 1 more figures