Table of Contents
Fetching ...

Generative Denoise Distillation: Simple Stochastic Noises Induce Efficient Knowledge Transfer for Dense Prediction

Zhaoge Liu, Xiaohao Xu, Yunkang Cao, Weiming Shen

TL;DR

The paper tackles the inefficiency of traditional teacher-student knowledge transfer in dense prediction tasks by introducing Generative Denoise Distillation (GDD), which injects stochastic noise into the student’s concept features and uses a generation module to produce denoised, instance-focused embeddings aligned with the teacher via channel-wise KL divergence. The method combines a simple task loss with a distillation loss, $ ext{L} = ext{L}_{task} + oldsymbol{b1} ext{L}_{distill}$, and demonstrates state-of-the-art or competitive results across semantic segmentation, instance segmentation, and object detection on Cityscapes and COCO. Key contributions include the stochastic noise generation, the Instantiation Denoise Network, and the Channel Knowledge Alignment mechanism, which together improve knowledge transfer by focusing on channel-wise instance information rather than exhaustive spatial mimicry. The results suggest GDD’s practical impact for deploying efficient, high-performing dense prediction models, with potential for broader application and future refinements in noise design and multi-teacher setups.

Abstract

Knowledge distillation is the process of transferring knowledge from a more powerful large model (teacher) to a simpler counterpart (student). Numerous current approaches involve the student imitating the knowledge of the teacher directly. However, redundancy still exists in the learned representations through these prevalent methods, which tend to learn each spatial location's features indiscriminately. To derive a more compact representation (concept feature) from the teacher, inspired by human cognition, we suggest an innovative method, termed Generative Denoise Distillation (GDD), where stochastic noises are added to the concept feature of the student to embed them into the generated instance feature from a shallow network. Then, the generated instance feature is aligned with the knowledge of the instance from the teacher. We extensively experiment with object detection, instance segmentation, and semantic segmentation to demonstrate the versatility and effectiveness of our method. Notably, GDD achieves new state-of-the-art performance in the tasks mentioned above. We have achieved substantial improvements in semantic segmentation by enhancing PspNet and DeepLabV3, both of which are based on ResNet-18, resulting in mIoU scores of 74.67 and 77.69, respectively, surpassing their previous scores of 69.85 and 73.20 on the Cityscapes dataset of 20 categories. The source code is available at https://github.com/ZhgLiu/GDD.

Generative Denoise Distillation: Simple Stochastic Noises Induce Efficient Knowledge Transfer for Dense Prediction

TL;DR

The paper tackles the inefficiency of traditional teacher-student knowledge transfer in dense prediction tasks by introducing Generative Denoise Distillation (GDD), which injects stochastic noise into the student’s concept features and uses a generation module to produce denoised, instance-focused embeddings aligned with the teacher via channel-wise KL divergence. The method combines a simple task loss with a distillation loss, , and demonstrates state-of-the-art or competitive results across semantic segmentation, instance segmentation, and object detection on Cityscapes and COCO. Key contributions include the stochastic noise generation, the Instantiation Denoise Network, and the Channel Knowledge Alignment mechanism, which together improve knowledge transfer by focusing on channel-wise instance information rather than exhaustive spatial mimicry. The results suggest GDD’s practical impact for deploying efficient, high-performing dense prediction models, with potential for broader application and future refinements in noise design and multi-teacher setups.

Abstract

Knowledge distillation is the process of transferring knowledge from a more powerful large model (teacher) to a simpler counterpart (student). Numerous current approaches involve the student imitating the knowledge of the teacher directly. However, redundancy still exists in the learned representations through these prevalent methods, which tend to learn each spatial location's features indiscriminately. To derive a more compact representation (concept feature) from the teacher, inspired by human cognition, we suggest an innovative method, termed Generative Denoise Distillation (GDD), where stochastic noises are added to the concept feature of the student to embed them into the generated instance feature from a shallow network. Then, the generated instance feature is aligned with the knowledge of the instance from the teacher. We extensively experiment with object detection, instance segmentation, and semantic segmentation to demonstrate the versatility and effectiveness of our method. Notably, GDD achieves new state-of-the-art performance in the tasks mentioned above. We have achieved substantial improvements in semantic segmentation by enhancing PspNet and DeepLabV3, both of which are based on ResNet-18, resulting in mIoU scores of 74.67 and 77.69, respectively, surpassing their previous scores of 69.85 and 73.20 on the Cityscapes dataset of 20 categories. The source code is available at https://github.com/ZhgLiu/GDD.
Paper Structure (28 sections, 9 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 28 sections, 9 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Visualization of different knowledge distillation methods. (a) Conventional spatial knowledge distillation. (b) Channel-wise knowledge distillation. (c) Mask generative knowledge distillation. (d) Stochastic generative knowledge distillation.
  • Figure 2: Visualization of the backbone layer feature map. Student: DeepLabV3-Res18, Teacher: PspNet-R101. MGD is one of the most effective distillation methods, whereas GDD is a novel method proposed in this paper.
  • Figure 3: Pipeline overview of our method, Generative Denoise Distillation (GDD). In terms of the student model, we add stochastic Gaussian noise into its feature maps for perturbation, then use a generation module to obtain new feature embedding, and finally distillation for different instance objects in the channel dimension.
  • Figure 4: Semantic segmentation: qualitative comparative results. (a) Image: raw image, (b) GT: ground truth, (c) Student: DeepLabV3-Res18, (d) Mask Generative distillation yang2022masked, (e) GDD: Generative Denoise Distillation, (f) Teacher: PspNet-Res101.
  • Figure 5: Object detection: qualitative comparative results. GT: ground truth, MGD: Mask Generative distillation yang2022masked, GDD: Generative Denoise Distillation.
  • ...and 1 more figures