Table of Contents
Fetching ...

Make a Strong Teacher with Label Assistance: A Novel Knowledge Distillation Approach for Semantic Segmentation

Shoumeng Qiu, Jie Chen, Xinrun Li, Ru Wan, Xiangyang Xue, Jian Pu

TL;DR

This paper tackles efficient semantic segmentation by proposing Label Assisted Distillation (LAD), which enables a lightweight teacher to leverage label information as an input channel through a Label Noising Module (LNM). A dual-path consistency training regime stabilizes the teacher's outputs despite random label perturbations, while the student is trained with standard distillation. Experiments across five challenging datasets and five backbone models show consistent mIoU gains, demonstrating the method's generality and practical viability, with code released for reproducibility. By reframing label information as privileged input and validating its transferability across model pairs, the work offers new insights into privileged-information-based knowledge distillation for segmentation and other vision tasks.

Abstract

In this paper, we introduce a novel knowledge distillation approach for the semantic segmentation task. Unlike previous methods that rely on power-trained teachers or other modalities to provide additional knowledge, our approach does not require complex teacher models or information from extra sensors. Specifically, for the teacher model training, we propose to noise the label and then incorporate it into input to effectively boost the lightweight teacher performance. To ensure the robustness of the teacher model against the introduced noise, we propose a dual-path consistency training strategy featuring a distance loss between the outputs of two paths. For the student model training, we keep it consistent with the standard distillation for simplicity. Our approach not only boosts the efficacy of knowledge distillation but also increases the flexibility in selecting teacher and student models. To demonstrate the advantages of our Label Assisted Distillation (LAD) method, we conduct extensive experiments on five challenging datasets including Cityscapes, ADE20K, PASCAL-VOC, COCO-Stuff 10K, and COCO-Stuff 164K, five popular models: FCN, PSPNet, DeepLabV3, STDC, and OCRNet, and results show the effectiveness and generalization of our approach. We posit that incorporating labels into the input, as demonstrated in our work, will provide valuable insights into related fields. Code is available at https://github.com/skyshoumeng/Label_Assisted_Distillation.

Make a Strong Teacher with Label Assistance: A Novel Knowledge Distillation Approach for Semantic Segmentation

TL;DR

This paper tackles efficient semantic segmentation by proposing Label Assisted Distillation (LAD), which enables a lightweight teacher to leverage label information as an input channel through a Label Noising Module (LNM). A dual-path consistency training regime stabilizes the teacher's outputs despite random label perturbations, while the student is trained with standard distillation. Experiments across five challenging datasets and five backbone models show consistent mIoU gains, demonstrating the method's generality and practical viability, with code released for reproducibility. By reframing label information as privileged input and validating its transferability across model pairs, the work offers new insights into privileged-information-based knowledge distillation for segmentation and other vision tasks.

Abstract

In this paper, we introduce a novel knowledge distillation approach for the semantic segmentation task. Unlike previous methods that rely on power-trained teachers or other modalities to provide additional knowledge, our approach does not require complex teacher models or information from extra sensors. Specifically, for the teacher model training, we propose to noise the label and then incorporate it into input to effectively boost the lightweight teacher performance. To ensure the robustness of the teacher model against the introduced noise, we propose a dual-path consistency training strategy featuring a distance loss between the outputs of two paths. For the student model training, we keep it consistent with the standard distillation for simplicity. Our approach not only boosts the efficacy of knowledge distillation but also increases the flexibility in selecting teacher and student models. To demonstrate the advantages of our Label Assisted Distillation (LAD) method, we conduct extensive experiments on five challenging datasets including Cityscapes, ADE20K, PASCAL-VOC, COCO-Stuff 10K, and COCO-Stuff 164K, five popular models: FCN, PSPNet, DeepLabV3, STDC, and OCRNet, and results show the effectiveness and generalization of our approach. We posit that incorporating labels into the input, as demonstrated in our work, will provide valuable insights into related fields. Code is available at https://github.com/skyshoumeng/Label_Assisted_Distillation.
Paper Structure (16 sections, 10 equations, 4 figures, 9 tables)

This paper contains 16 sections, 10 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Comparison with two main categories of distillation approaches. S and T indicate the student and teacher model, $\mathcal{L}_{seg}$ denotes the segmentation task loss and $\mathcal{L}_{kd}$ denotes the distillation loss. Additional knowledge mainly from: (a) power-pretrained teacher model; (b) teacher that takes additional modalities as input; (c) lightweight teacher model that takes processed label as input. Our approach (c) shows a clear advantage as it has no requirements for complex teacher model or other modalities.
  • Figure 2: The overall pipeline of our distillation framework. © denotes the concatenation operation on channel, $\bigoplus$ denotes the addition operation. For the teacher model training within the dashed box above, the Label Noising Module (LNM) and the teacher model are duplicated into two copies. Then, the image and noised label are fed in respectively, The outputs of models are undergo supervision of the label. Additionally, we integrate a consistency constraint loss between the two predictions. We only need to retain one branch of teacher training for student model in distillation learning.
  • Figure 3: Details of label noising module. $\bigotimes$ denotes the multiplication operation, $\bigoplus$ denotes the addition operation. The label is first represented in one-hot encoding, then the channel of each class is multiplied by a weight obtained from random sampling. After that, the results are added along the channel dimension, resulting in the distortion of the class index information. Subsequently, random noise is added for each pixel, yielding the final noised label, which is referred as privilege information in this paper.
  • Figure 4: Experiment on the contribution of different inputs to the predictions. Here we choose the contribution to the class car for visualization, where red regions correspond to high contribution for the class. We use LayerCAM jiang2021layercam for its better performance in lower layers. Best viewed with zoomed in.