Table of Contents
Fetching ...

FreeKD: Knowledge Distillation via Semantic Frequency Prompt

Yuan Zhang, Tao Huang, Jiaming Liu, Tao Jiang, Kuan Cheng, Shanghang Zhang

TL;DR

This work introduces FreeKD, a frequency-domain knowledge distillation framework for dense prediction that addresses information loss from teacher downsampling. It leverages semantic Frequency Prompts to localize high- and low-frequency PoIs and a position-aware relational loss to emphasis cross-layer, channel-wise importance, enabling precise frequency imitation. The approach yields consistent improvements over spatial-based KD across object detection, semantic segmentation, corruption robustness, and large-scale vision models, demonstrating strong generalization and robustness. Overall, FreeKD provides a principled, plug-in mechanism to distill frequency-domain information, offering practical benefits for efficient deployment of dense prediction systems.

Abstract

Knowledge distillation (KD) has been applied to various tasks successfully, and mainstream methods typically boost the student model via spatial imitation losses. However, the consecutive downsamplings induced in the spatial domain of teacher model is a type of corruption, hindering the student from analyzing what specific information needs to be imitated, which results in accuracy degradation. To better understand the underlying pattern of corrupted feature maps, we shift our attention to the frequency domain. During frequency distillation, we encounter a new challenge: the low-frequency bands convey general but minimal context, while the high are more informative but also introduce noise. Not each pixel within the frequency bands contributes equally to the performance. To address the above problem: (1) We propose the Frequency Prompt plugged into the teacher model, absorbing the semantic frequency context during finetuning. (2) During the distillation period, a pixel-wise frequency mask is generated via Frequency Prompt, to localize those pixel of interests (PoIs) in various frequency bands. Additionally, we employ a position-aware relational frequency loss for dense prediction tasks, delivering a high-order spatial enhancement to the student model. We dub our Frequency Knowledge Distillation method as FreeKD, which determines the optimal localization and extent for the frequency distillation. Extensive experiments demonstrate that FreeKD not only outperforms spatial-based distillation methods consistently on dense prediction tasks (e.g., FreeKD brings 3.8 AP gains for RepPoints-R50 on COCO2017 and 4.55 mIoU gains for PSPNet-R18 on Cityscapes), but also conveys more robustness to the student. Notably, we also validate the generalization of our approach on large-scale vision models (e.g., DINO and SAM).

FreeKD: Knowledge Distillation via Semantic Frequency Prompt

TL;DR

This work introduces FreeKD, a frequency-domain knowledge distillation framework for dense prediction that addresses information loss from teacher downsampling. It leverages semantic Frequency Prompts to localize high- and low-frequency PoIs and a position-aware relational loss to emphasis cross-layer, channel-wise importance, enabling precise frequency imitation. The approach yields consistent improvements over spatial-based KD across object detection, semantic segmentation, corruption robustness, and large-scale vision models, demonstrating strong generalization and robustness. Overall, FreeKD provides a principled, plug-in mechanism to distill frequency-domain information, offering practical benefits for efficient deployment of dense prediction systems.

Abstract

Knowledge distillation (KD) has been applied to various tasks successfully, and mainstream methods typically boost the student model via spatial imitation losses. However, the consecutive downsamplings induced in the spatial domain of teacher model is a type of corruption, hindering the student from analyzing what specific information needs to be imitated, which results in accuracy degradation. To better understand the underlying pattern of corrupted feature maps, we shift our attention to the frequency domain. During frequency distillation, we encounter a new challenge: the low-frequency bands convey general but minimal context, while the high are more informative but also introduce noise. Not each pixel within the frequency bands contributes equally to the performance. To address the above problem: (1) We propose the Frequency Prompt plugged into the teacher model, absorbing the semantic frequency context during finetuning. (2) During the distillation period, a pixel-wise frequency mask is generated via Frequency Prompt, to localize those pixel of interests (PoIs) in various frequency bands. Additionally, we employ a position-aware relational frequency loss for dense prediction tasks, delivering a high-order spatial enhancement to the student model. We dub our Frequency Knowledge Distillation method as FreeKD, which determines the optimal localization and extent for the frequency distillation. Extensive experiments demonstrate that FreeKD not only outperforms spatial-based distillation methods consistently on dense prediction tasks (e.g., FreeKD brings 3.8 AP gains for RepPoints-R50 on COCO2017 and 4.55 mIoU gains for PSPNet-R18 on Cityscapes), but also conveys more robustness to the student. Notably, we also validate the generalization of our approach on large-scale vision models (e.g., DINO and SAM).
Paper Structure (37 sections, 17 equations, 8 figures, 10 tables)

This paper contains 37 sections, 17 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Comparison of the presentation of the bear at different downsampling ratios on spatial and frequency domain.
  • Figure 2: Comparisons with other insertion methods of spatial prompts. (a) Prompts are inserted into the encoder layer as tokens. (b) Sum-wise on RGB channels of input image. (c) Ours interact with intermediate features. Best view in color.
  • Figure 3: Overview of our FreeKD pipeline. The pipeline includes two stages. Stage 1: Frequency prompts make interaction with intermediate frequency bands, and are supervised by the teacher task loss. Stage 2: First, the distillation feature maps of student and teacher transform into the frequency domain, respectively. Then, receiving frequency prompts from stage 1, we request the frozen ones multiply with teacher frequency bands, and generate the PoIs of bands. Finally, a channel-wise positional-aware weight is determined by the teacher spatial gate and student gate together. The flow (1) in the figure decides where to distill and flow (2) indicates the extent of the distillation.
  • Figure 4: Visualization of student features, student distilled with FreeKD features and teacher features on COCO dataset. The cases are randomly selected from val set and the heatmaps are generated with AblationCAM ramaswamy2020ablation.
  • Figure 5: Visualization of high-frequency pixels of interests on COCO dataset via RepPoints-X101.
  • ...and 3 more figures