Sparse Refinement for Efficient High-Resolution Semantic Segmentation

Zhijian Liu; Zhuoyang Zhang; Samir Khaki; Shang Yang; Haotian Tang; Chenfeng Xu; Kurt Keutzer; Song Han

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

Zhijian Liu, Zhuoyang Zhang, Samir Khaki, Shang Yang, Haotian Tang, Chenfeng Xu, Kurt Keutzer, Song Han

TL;DR

SparseRefine tackles the latency of high-resolution semantic segmentation by combining dense low-resolution predictions with sparse high-resolution refinements. It uses an entropy-based pixel selector to choose a small, uncertain subset of pixels, refines them with a sparse feature extractor operating on high-resolution input, and fuses the refinements with a gated ensembler. The gating mechanism is defined by $w = \text{sigmoid}(g([y_1; y_2; e_1; e_2]))$ and the final prediction is $y = f(w \cdot y_1 + (1 - w) \cdot y_2)$, enabling principled fusion of dense and sparse information. The approach is architecture-agnostic and achieves 1.5× to 3.7× speedups on multiple backbones (e.g., HRNet-W48, SegFormer-B5, Mask2Former, SegNeXt-L) on Cityscapes with negligible accuracy loss, while generalizing across datasets such as Pascal VOC, BDD100K, DeepGlobe, and ISIC. This dense-sparse paradigm enables practical deployment of high-resolution semantic segmentation in latency-sensitive applications.

Abstract

Semantic segmentation empowers numerous real-world applications, such as autonomous driving and augmented/mixed reality. These applications often operate on high-resolution images (e.g., 8 megapixels) to capture the fine details. However, this comes at the cost of considerable computational complexity, hindering the deployment in latency-sensitive scenarios. In this paper, we introduce SparseRefine, a novel approach that enhances dense low-resolution predictions with sparse high-resolution refinements. Based on coarse low-resolution outputs, SparseRefine first uses an entropy selector to identify a sparse set of pixels with high entropy. It then employs a sparse feature extractor to efficiently generate the refinements for those pixels of interest. Finally, it leverages a gated ensembler to apply these sparse refinements to the initial coarse predictions. SparseRefine can be seamlessly integrated into any existing semantic segmentation model, regardless of CNN- or ViT-based. SparseRefine achieves significant speedup: 1.5 to 3.7 times when applied to HRNet-W48, SegFormer-B5, Mask2Former-T/L and SegNeXt-L on Cityscapes, with negligible to no loss of accuracy. Our "dense+sparse" paradigm paves the way for efficient high-resolution visual computing.

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

TL;DR

and the final prediction is

, enabling principled fusion of dense and sparse information. The approach is architecture-agnostic and achieves 1.5× to 3.7× speedups on multiple backbones (e.g., HRNet-W48, SegFormer-B5, Mask2Former, SegNeXt-L) on Cityscapes with negligible accuracy loss, while generalizing across datasets such as Pascal VOC, BDD100K, DeepGlobe, and ISIC. This dense-sparse paradigm enables practical deployment of high-resolution semantic segmentation in latency-sensitive applications.

Abstract

Paper Structure (32 sections, 1 equation, 4 figures, 8 tables)

This paper contains 32 sections, 1 equation, 4 figures, 8 tables.

Introduction
Related Work
Semantic Segmentation
Activation Sparsity
Mask Refinement
Multi-Scale Models
SparseRefine
Dense Low-Resolution Prediction
Sparse High-Resolution Refinement
Entropy Selector.
Sparse Feature Extractor.
Gated Ensembler.
Experiments
Setup
Dataset.
...and 17 more sections

Figures (4)

Figure 1: Processing dense high-resolution inputs is computationally expensive. In this paper, we propose an alternative approach by integrating dense low-resolution and sparse high-resolution inputs, which provide complementary information about the overall scene layout and intricate object details. Leveraging the lower resolution and sparsity of these inputs allows for more efficient processing.
Figure 2: SparseRefine improves initial dense low-resolution predictions with sparse high-resolution refinements. It first performs the dense low-resolution inference on the downsampled image to obtain the initial prediction. Subsequently, it uses an entropy selector to identify a sparse set of pixels with high entropy, and then employs a sparse feature extractor to efficiently generate refinements for those selected pixels. Afterwards, it applies these sparse refinements to the initial predictions with a gated ensembler.
Figure 3: The entropy map exhibits a strong correlation with the error map (a). Recall rates (b) and precision rates (c) for the entropy selector, magnitude selector, and learnable selector.
Figure 4: SparseRefine improves the low-resolution (LR) baseline with substantially better recognition of small, distant objects and finer detail around object boundaries.

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

TL;DR

Abstract

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)