Table of Contents
Fetching ...

A Deep Semantic Segmentation Network with Semantic and Contextual Refinements

Zhiyan Wang, Deyin Liu, Lin Yuanbo Wu, Song Wang, Xin Guo, Lin Qi

TL;DR

The paper tackles semantic segmentation by addressing two core challenges: misalignment from downsampling and the need for global context. It introduces the Semantic Refinement Module (SRM), which learns neighbor-aware per-pixel offsets guided by high-resolution features, and the Contextual Refinement Module (CRM), which aggregates multi-stage features and applies sequential channel–spatial attention to capture global context. Together, SRM and CRM yield improved boundary delineation and richer context modeling, achieving state-of-the-art results on Cityscapes, Bdd100K, and ADE20K, including a lightweight network reaching $82.5\%$ mIoU with $137.9$ GFLOPs. The approach demonstrates strong accuracy with efficient computation, making it practical for real-time or resource-constrained semantic segmentation systems.

Abstract

Semantic segmentation is a fundamental task in multimedia processing, which can be used for analyzing, understanding, editing contents of images and videos, among others. To accelerate the analysis of multimedia data, existing segmentation researches tend to extract semantic information by progressively reducing the spatial resolutions of feature maps. However, this approach introduces a misalignment problem when restoring the resolution of high-level feature maps. In this paper, we design a Semantic Refinement Module (SRM) to address this issue within the segmentation network. Specifically, SRM is designed to learn a transformation offset for each pixel in the upsampled feature maps, guided by high-resolution feature maps and neighboring offsets. By applying these offsets to the upsampled feature maps, SRM enhances the semantic representation of the segmentation network, particularly for pixels around object boundaries. Furthermore, a Contextual Refinement Module (CRM) is presented to capture global context information across both spatial and channel dimensions. To balance dimensions between channel and space, we aggregate the semantic maps from all four stages of the backbone to enrich channel context information. The efficacy of these proposed modules is validated on three widely used datasets-Cityscapes, Bdd100K, and ADE20K-demonstrating superior performance compared to state-of-the-art methods. Additionally, this paper extends these modules to a lightweight segmentation network, achieving an mIoU of 82.5% on the Cityscapes validation set with only 137.9 GFLOPs.

A Deep Semantic Segmentation Network with Semantic and Contextual Refinements

TL;DR

The paper tackles semantic segmentation by addressing two core challenges: misalignment from downsampling and the need for global context. It introduces the Semantic Refinement Module (SRM), which learns neighbor-aware per-pixel offsets guided by high-resolution features, and the Contextual Refinement Module (CRM), which aggregates multi-stage features and applies sequential channel–spatial attention to capture global context. Together, SRM and CRM yield improved boundary delineation and richer context modeling, achieving state-of-the-art results on Cityscapes, Bdd100K, and ADE20K, including a lightweight network reaching mIoU with GFLOPs. The approach demonstrates strong accuracy with efficient computation, making it practical for real-time or resource-constrained semantic segmentation systems.

Abstract

Semantic segmentation is a fundamental task in multimedia processing, which can be used for analyzing, understanding, editing contents of images and videos, among others. To accelerate the analysis of multimedia data, existing segmentation researches tend to extract semantic information by progressively reducing the spatial resolutions of feature maps. However, this approach introduces a misalignment problem when restoring the resolution of high-level feature maps. In this paper, we design a Semantic Refinement Module (SRM) to address this issue within the segmentation network. Specifically, SRM is designed to learn a transformation offset for each pixel in the upsampled feature maps, guided by high-resolution feature maps and neighboring offsets. By applying these offsets to the upsampled feature maps, SRM enhances the semantic representation of the segmentation network, particularly for pixels around object boundaries. Furthermore, a Contextual Refinement Module (CRM) is presented to capture global context information across both spatial and channel dimensions. To balance dimensions between channel and space, we aggregate the semantic maps from all four stages of the backbone to enrich channel context information. The efficacy of these proposed modules is validated on three widely used datasets-Cityscapes, Bdd100K, and ADE20K-demonstrating superior performance compared to state-of-the-art methods. Additionally, this paper extends these modules to a lightweight segmentation network, achieving an mIoU of 82.5% on the Cityscapes validation set with only 137.9 GFLOPs.

Paper Structure

This paper contains 25 sections, 12 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Comparison of Bilinear upsampling and Learnable offset-based upsampling for a Low-Resolution (LR) feature map.
  • Figure 2: Some visualization comparisons on the Cityscapes dataset. (a) w/o mask layer, (b) w/ mask layer. The mask layer exploits the contribution of the neighbors' offsets to the ultimate offset for each pixel. The first column displays the outputs of the method without the "mask" layer. The second column indicates the outputs of the method with the "mask" layer. It is observed that the object boundaries generated by the "mask" layer are clearer, such as the "truck" in the first row and the "traffic sign" in the second row. In the third row, the regions of "sidewalk" and "road" are classified more accurately when using the method with the "mask" layer.
  • Figure 3: Comparison of the average pooling strategy and attention mechanism. (a) Average pooling strategy captures the contexts by taking the average of all pixels within the pooling region, overlooking the fact that different pixels may make unequal contributions. The pooling region for global average pooling encompasses the entire feature map. (b) Attention mechanism adaptively captures global contexts by calculating the response at a position through a weighted aggregation of features from all positions.
  • Figure 4: The structure of the proposed method. The Contextual Refinement Module takes all the four stages feature from the backbone as inputs to extract semantic features. The Semantic Refinement Module mitigates the feature misalignment problem during upsampling operations.
  • Figure 5: Illustration of Semantic Refinement Module. SRM first takes adjacent low-resolution features and high-resolution features as inputs to predict an initial transformation offset and a weight mask $M_w$. The ultimate offset map is the result of a weighted combination of the neighborhoods on the initial offset map with the mask.
  • ...and 4 more figures