Table of Contents
Fetching ...

SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation

Yunxiang Fu, Meng Lou, Yizhou Yu

TL;DR

SegMAN tackles the challenge of achieving global context, local detail, and omni-scale representation in semantic segmentation. It introduces a SegMAN Encoder that fuses local attention with dynamic State Space Models (SS2D) via a LASS token mixer, and a MMSCopE decoder that adaptively extracts multi-scale context using SS2D scans. The approach delivers state-of-the-art mIoU on ADE20K, Cityscapes, and COCO-Stuff while maintaining lower GFLOPs and showing plug-and-play compatibility with existing backbones and decoders. This work provides an efficient, scalable framework for omni-scale context modeling in dense prediction tasks.

Abstract

High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. Specifically, the SegMAN Encoder synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. Our SegMAN-B Encoder achieves 85.1% ImageNet-1k accuracy (+1.5% over VMamba-S with fewer parameters). When paired with our decoder, the full SegMAN-B model achieves 52.6% mIoU on ADE20K (+1.6% over SegNeXt-L with 15% fewer GFLOPs), 83.8% mIoU on Cityscapes (+2.1% over SegFormer-B3 with half the GFLOPs), and 1.6% higher mIoU than VWFormer-B3 on COCO-Stuff with lower GFLOPs. Our code is available at https://github.com/yunxiangfu2001/SegMAN.

SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation

TL;DR

SegMAN tackles the challenge of achieving global context, local detail, and omni-scale representation in semantic segmentation. It introduces a SegMAN Encoder that fuses local attention with dynamic State Space Models (SS2D) via a LASS token mixer, and a MMSCopE decoder that adaptively extracts multi-scale context using SS2D scans. The approach delivers state-of-the-art mIoU on ADE20K, Cityscapes, and COCO-Stuff while maintaining lower GFLOPs and showing plug-and-play compatibility with existing backbones and decoders. This work provides an efficient, scalable framework for omni-scale context modeling in dense prediction tasks.

Abstract

High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. Specifically, the SegMAN Encoder synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. Our SegMAN-B Encoder achieves 85.1% ImageNet-1k accuracy (+1.5% over VMamba-S with fewer parameters). When paired with our decoder, the full SegMAN-B model achieves 52.6% mIoU on ADE20K (+1.6% over SegNeXt-L with 15% fewer GFLOPs), 83.8% mIoU on Cityscapes (+2.1% over SegFormer-B3 with half the GFLOPs), and 1.6% higher mIoU than VWFormer-B3 on COCO-Stuff with lower GFLOPs. Our code is available at https://github.com/yunxiangfu2001/SegMAN.

Paper Structure

This paper contains 18 sections, 6 figures, 13 tables.

Figures (6)

  • Figure 1: SegMAN Encoder classification performance compared with representative vision backbones alongside semantic segmentation results of the full SegMAN model compared to prior state-of-the-art models.
  • Figure 2: Qualitative analysis of receptive field patterns and segmentation performance for small-sized models (27M-29M parameters). Left: Visualization of effective receptive fields (ERF) on the Cityscapes validation set at 1024×1024 resolution, illustrating SegMAN's stronger global context modeling capacity in comparison to existing state-of-the-art models. Right: Segmentation maps highlighting SegMAN's superior capacity to encode fine-grained local details that are often missed by existing approaches.
  • Figure 3: Overall Architecture of SegMAN. (a) Hierarchical SegMAN Encoder. (b) LASS for modeling global contexts and local details with linear complexity. (c) The SegMAN Decoder. (d) The MMSCopE module for multi-scale contexts extraction.
  • Figure 4: Qualitative results on ADE20K. Zoom in for best view.
  • Figure 5: Qualitative results on Cityscapes. Zoom in for best view.
  • ...and 1 more figures