Table of Contents
Fetching ...

Relative Difficulty Distillation for Semantic Segmentation

Dong Liang, Yue Sun, Yun Du, Songcan Chen, Sheng-Jun Huang

TL;DR

This work tackles training instability in knowledge distillation for semantic segmentation by reframing knowledge as pixel-level relative difficulty and introducing Relative Difficulty Distillation (RDD). RDD operates in two stages, TFE-RDD and TSE-RDD, using predictor discrepancies to generate per-pixel difficulty maps that weight the task loss and steer learning without adding extra objective terms. Empirical results across Cityscapes, CamVid, VOC2012, and ADE20k show that RDD consistently improves performance over state-of-the-art KD methods and can enhance existing distillation techniques with minimal training-time overhead. The approach offers a practical, scalable path to boosting the performance ceiling of lightweight segmentation models while maintaining training stability and compatibility with various backbones.

Abstract

Current knowledge distillation (KD) methods primarily focus on transferring various structured knowledge and designing corresponding optimization goals to encourage the student network to imitate the output of the teacher network. However, introducing too many additional optimization objectives may lead to unstable training, such as gradient conflicts. Moreover, these methods ignored the guidelines of relative learning difficulty between the teacher and student networks. Inspired by human cognitive science, in this paper, we redefine knowledge from a new perspective -- the student and teacher networks' relative difficulty of samples, and propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD). We propose a two-stage RDD framework: Teacher-Full Evaluated RDD (TFE-RDD) and Teacher-Student Evaluated RDD (TSE-RDD). RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals, thus avoiding adjusting learning weights for multiple losses. Extensive experimental evaluations using a general distillation loss function on popular datasets such as Cityscapes, CamVid, Pascal VOC, and ADE20k demonstrate the effectiveness of RDD against state-of-the-art KD methods. Additionally, our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.

Relative Difficulty Distillation for Semantic Segmentation

TL;DR

This work tackles training instability in knowledge distillation for semantic segmentation by reframing knowledge as pixel-level relative difficulty and introducing Relative Difficulty Distillation (RDD). RDD operates in two stages, TFE-RDD and TSE-RDD, using predictor discrepancies to generate per-pixel difficulty maps that weight the task loss and steer learning without adding extra objective terms. Empirical results across Cityscapes, CamVid, VOC2012, and ADE20k show that RDD consistently improves performance over state-of-the-art KD methods and can enhance existing distillation techniques with minimal training-time overhead. The approach offers a practical, scalable path to boosting the performance ceiling of lightweight segmentation models while maintaining training stability and compatibility with various backbones.

Abstract

Current knowledge distillation (KD) methods primarily focus on transferring various structured knowledge and designing corresponding optimization goals to encourage the student network to imitate the output of the teacher network. However, introducing too many additional optimization objectives may lead to unstable training, such as gradient conflicts. Moreover, these methods ignored the guidelines of relative learning difficulty between the teacher and student networks. Inspired by human cognitive science, in this paper, we redefine knowledge from a new perspective -- the student and teacher networks' relative difficulty of samples, and propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD). We propose a two-stage RDD framework: Teacher-Full Evaluated RDD (TFE-RDD) and Teacher-Student Evaluated RDD (TSE-RDD). RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals, thus avoiding adjusting learning weights for multiple losses. Extensive experimental evaluations using a general distillation loss function on popular datasets such as Cityscapes, CamVid, Pascal VOC, and ADE20k demonstrate the effectiveness of RDD against state-of-the-art KD methods. Additionally, our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
Paper Structure (25 sections, 19 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 25 sections, 19 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: The prediction discrepancy based on confidence between the teacher and student networks is used to assess the learning difficulty of pixels. There are three situations in which the student and teacher networks evaluate the difficulty of a pixel: Case 1, where both networks evaluate the pixel as easy; Case 2, where both networks evaluate the pixel as difficult; Case 3, where there is disagreement on the difficulty of the pixel. Note that for Case 3, the pixel with prediction discrepancies is the difficult pixel that the student network should learn.
  • Figure 2: The proposed Relative Difficulty Distillation (RDD).
  • Figure 3: In the TFE-RDD stage, the difficulty map based on prediction discrepancy is obtained using the prediction results of primary and auxiliary classifiers of the teacher network, and the student network is guided to learn simple pixels for efficient fitting.
  • Figure 4: In the TSE-RDD stage, the difficulty maps based on confidence are obtained using the prediction results of the teacher and student networks. The filtered difficulty maps are applied to Exclusive-OR operations to obtain valuable difficult pixels and expand the upper performance bound.
  • Figure 5: Qualitative segmentation results on the validation set of Cityscapes using DeepLabV3-ResNet18 as student network and DeepLabV3-ResNet101 as teacher network: (a) Input image. (b) ground truth. (c) Results of original student network without KD. (d) Results of AT 52. (e) Results of CIRKD 29. (f) Results of DSD 74. (g) Results of the proposed RDD. (h) Results of teacher network.
  • ...and 1 more figures