Table of Contents
Fetching ...

Structural and Statistical Texture Knowledge Distillation for Semantic Segmentation

Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, Hongtao Lu

TL;DR

The paper tackles semantic segmentation by addressing a gap in knowledge distillation: loss of low-level texture cues. It introduces SSTKD, combining a Contourlet Decomposition Module (CDM) for structural texture and a Denoised Texture Intensity Equalization Module (DTIEM) for statistical texture, both distilled from a teacher to a lighter student. The framework adds dedicated losses L_{str} and L_{sta} alongside standard response and adversarial losses within a PSPNet-based teacher–student setup, yielding state-of-the-art results on Cityscapes, Pascal VOC 2012, and ADE20K. This texture-centric KD approach improves boundary detail and intensity distribution while maintaining efficiency, making it practical for high-resolution semantic segmentation tasks.

Abstract

Existing knowledge distillation works for semantic segmentation mainly focus on transferring high-level contextual knowledge from teacher to student. However, low-level texture knowledge is also of vital importance for characterizing the local structural pattern and global statistical property, such as boundary, smoothness, regularity and color contrast, which may not be well addressed by high-level deep features. In this paper, we are intended to take full advantage of both structural and statistical texture knowledge and propose a novel Structural and Statistical Texture Knowledge Distillation (SSTKD) framework for semantic segmentation. Specifically, for structural texture knowledge, we introduce a Contourlet Decomposition Module (CDM) that decomposes low-level features with iterative Laplacian pyramid and directional filter bank to mine the structural texture knowledge. For statistical knowledge, we propose a Denoised Texture Intensity Equalization Module (DTIEM) to adaptively extract and enhance statistical texture knowledge through heuristics iterative quantization and denoised operation. Finally, each knowledge learning is supervised by an individual loss function, forcing the student network to mimic the teacher better from a broader perspective. Experiments show that the proposed method achieves state-of-the-art performance on Cityscapes, Pascal VOC 2012 and ADE20K datasets.

Structural and Statistical Texture Knowledge Distillation for Semantic Segmentation

TL;DR

The paper tackles semantic segmentation by addressing a gap in knowledge distillation: loss of low-level texture cues. It introduces SSTKD, combining a Contourlet Decomposition Module (CDM) for structural texture and a Denoised Texture Intensity Equalization Module (DTIEM) for statistical texture, both distilled from a teacher to a lighter student. The framework adds dedicated losses L_{str} and L_{sta} alongside standard response and adversarial losses within a PSPNet-based teacher–student setup, yielding state-of-the-art results on Cityscapes, Pascal VOC 2012, and ADE20K. This texture-centric KD approach improves boundary detail and intensity distribution while maintaining efficiency, making it practical for high-resolution semantic segmentation tasks.

Abstract

Existing knowledge distillation works for semantic segmentation mainly focus on transferring high-level contextual knowledge from teacher to student. However, low-level texture knowledge is also of vital importance for characterizing the local structural pattern and global statistical property, such as boundary, smoothness, regularity and color contrast, which may not be well addressed by high-level deep features. In this paper, we are intended to take full advantage of both structural and statistical texture knowledge and propose a novel Structural and Statistical Texture Knowledge Distillation (SSTKD) framework for semantic segmentation. Specifically, for structural texture knowledge, we introduce a Contourlet Decomposition Module (CDM) that decomposes low-level features with iterative Laplacian pyramid and directional filter bank to mine the structural texture knowledge. For statistical knowledge, we propose a Denoised Texture Intensity Equalization Module (DTIEM) to adaptively extract and enhance statistical texture knowledge through heuristics iterative quantization and denoised operation. Finally, each knowledge learning is supervised by an individual loss function, forcing the student network to mimic the teacher better from a broader perspective. Experiments show that the proposed method achieves state-of-the-art performance on Cityscapes, Pascal VOC 2012 and ADE20K datasets.
Paper Structure (13 sections, 10 equations, 5 figures, 6 tables)

This paper contains 13 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The overview of the structural and statistical texture knowledge distillation of an example image. Two kinds of the texture knowledge are extracted from the low-level feature of a CNN backbone. The original structural and statistical texture are fuzzy and in low-contrast. After distillation, the contour is clearer and the intensity contrast is more equalized, showing that two kinds of the texture are both enhanced.
  • Figure 2: An overview of our proposed framework. PSPNet zhao2017pyramid is used as the model architecture for both teacher and student network, which consists of the backbone network, pyramid pooling module (PPM) and the final output map. Apart from the response knowledge, we further propose to extract the texture knowledge from low-level features. The corresponding parts of two kinds of the texture knowledge are depicted in the light redbamberger1992filterc-cnnandrearczyk2016using and light green below the network pipeline, respectively.
  • Figure 3: LP decompositiondo2003contourletsdo2002contourletsc-cnn. The low-pass subbands $a$ is generated from the input $x$ with a low-pass analysis filters $H$ and a sampling matrix $S$. The high-pass subbands $b$ are then computed as the difference between $x$ and the prediction of $a$, with a sampling matrix $S$ followed by a low-pass synthesis filters $G$.
  • Figure 4: Comparison of visualization of the low-level feature from stage 1 of the backbone. KD means knowledge distillation. (a) is the original image. (b) is from the student network without texture knowledge distillation, (c) shows the changes after applying it in our method. Line 1 and 3 show the structural texture, while line 2 and 4 show the statistical texture.
  • Figure 5: Visual improvements on Cityscapes dataset: (a) orginal images, (b) w/o distillation, (c) Our distillation method, (d) ground truth. Our method improves the student network w/o distillation to produce more accurate and detailed results, which are circled by dotted lines.