Table of Contents
Fetching ...

S$^2$-FPN: Scale-ware Strip Attention Guided Feature Pyramid Network for Real-time Semantic Segmentation

Mohammed A. M. Elhassan, Chenhui Yang, Chenxi Huang, Tewodros Legesse Munea, Xin Hong, Abuzar B. M. Adam, Amina Benabid

TL;DR

This work addresses real-time semantic segmentation by balancing accuracy and speed through a lightweight architecture. It introduces S2-FPN, which combines three modules—SSAM for scale-aware, vertical-strip attention; APF for discriminative multi-scale fusion; and GFU for robust feature upsampling and fusion. Through extensive ablations and benchmarks on Cityscapes and CamVid, the approach demonstrates strong accuracy at high frame rates, e.g., up to $mIoU$ around 77% with >60 FPS on Cityscapes and substantial gains on CamVid. The contributions offer a practical pathway to deploy high-quality segmentation in real-time systems, such as autonomous driving, by reducing computational overhead while preserving contextual detail.

Abstract

Modern high-performance semantic segmentation methods employ a heavy backbone and dilated convolution to extract the relevant feature. Although extracting features with both contextual and semantic information is critical for the segmentation tasks, it brings a memory footprint and high computation cost for real-time applications. This paper presents a new model to achieve a trade-off between accuracy/speed for real-time road scene semantic segmentation. Specifically, we proposed a lightweight model named Scale-aware Strip Attention Guided Feature Pyramid Network (S$^2$-FPN). Our network consists of three main modules: Attention Pyramid Fusion (APF) module, Scale-aware Strip Attention Module (SSAM), and Global Feature Upsample (GFU) module. APF adopts an attention mechanisms to learn discriminative multi-scale features and help close the semantic gap between different levels. APF uses the scale-aware attention to encode global context with vertical stripping operation and models the long-range dependencies, which helps relate pixels with similar semantic label. In addition, APF employs channel-wise reweighting block (CRB) to emphasize the channel features. Finally, the decoder of S$^2$-FPN then adopts GFU, which is used to fuse features from APF and the encoder. Extensive experiments have been conducted on two challenging semantic segmentation benchmarks, which demonstrate that our approach achieves better accuracy/speed trade-off with different model settings. The proposed models have achieved a results of 76.2\%mIoU/87.3FPS, 77.4\%mIoU/67FPS, and 77.8\%mIoU/30.5FPS on Cityscapes dataset, and 69.6\%mIoU,71.0\% mIoU, and 74.2\% mIoU on Camvid dataset. The code for this work will be made available at \url{https://github.com/mohamedac29/S2-FPN

S$^2$-FPN: Scale-ware Strip Attention Guided Feature Pyramid Network for Real-time Semantic Segmentation

TL;DR

This work addresses real-time semantic segmentation by balancing accuracy and speed through a lightweight architecture. It introduces S2-FPN, which combines three modules—SSAM for scale-aware, vertical-strip attention; APF for discriminative multi-scale fusion; and GFU for robust feature upsampling and fusion. Through extensive ablations and benchmarks on Cityscapes and CamVid, the approach demonstrates strong accuracy at high frame rates, e.g., up to around 77% with >60 FPS on Cityscapes and substantial gains on CamVid. The contributions offer a practical pathway to deploy high-quality segmentation in real-time systems, such as autonomous driving, by reducing computational overhead while preserving contextual detail.

Abstract

Modern high-performance semantic segmentation methods employ a heavy backbone and dilated convolution to extract the relevant feature. Although extracting features with both contextual and semantic information is critical for the segmentation tasks, it brings a memory footprint and high computation cost for real-time applications. This paper presents a new model to achieve a trade-off between accuracy/speed for real-time road scene semantic segmentation. Specifically, we proposed a lightweight model named Scale-aware Strip Attention Guided Feature Pyramid Network (S-FPN). Our network consists of three main modules: Attention Pyramid Fusion (APF) module, Scale-aware Strip Attention Module (SSAM), and Global Feature Upsample (GFU) module. APF adopts an attention mechanisms to learn discriminative multi-scale features and help close the semantic gap between different levels. APF uses the scale-aware attention to encode global context with vertical stripping operation and models the long-range dependencies, which helps relate pixels with similar semantic label. In addition, APF employs channel-wise reweighting block (CRB) to emphasize the channel features. Finally, the decoder of S-FPN then adopts GFU, which is used to fuse features from APF and the encoder. Extensive experiments have been conducted on two challenging semantic segmentation benchmarks, which demonstrate that our approach achieves better accuracy/speed trade-off with different model settings. The proposed models have achieved a results of 76.2\%mIoU/87.3FPS, 77.4\%mIoU/67FPS, and 77.8\%mIoU/30.5FPS on Cityscapes dataset, and 69.6\%mIoU,71.0\% mIoU, and 74.2\% mIoU on Camvid dataset. The code for this work will be made available at \url{https://github.com/mohamedac29/S2-FPN
Paper Structure (19 sections, 11 equations, 8 figures, 8 tables)

This paper contains 19 sections, 11 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Accuracy/Speed performance comparison on the Cityscapes test set. Our methods are presented in red dots while other methods are presented in blue dots. Our approaches achieve state-of-the-art speed-accuracy trade-off
  • Figure 2: A comparison of an important semantic segmentation architectures,(a) encoder-decoder model,(b) two-pathway model,(c) feature pyramid model,(d) Scale-aware feature pyramid model (Ours).
  • Figure 3: The detailed architecture of the proposed Scale-aware Strip Attention Guided Pyramid Fusion model (S2-FPN). The model constructs from the following modules: (a) Encoder, which incorporates ResNet18 or ResNet34, (b) Attention Pyramid Fusion module, (c) illustrates the Global Feature Upsample(GFU) module, (e) Feature adaptation block (FAB), and (e) Components of Coarse Feature Generator block.
  • Figure 4: The illustration of Scale-Aware Strip Attention module SSAM.
  • Figure 5: An overview of the Attention Pyramid Fusion Module. (a) APF module architecture. (b) Components of the channel attention module (CAM). (c) Components of the feature refinement block (FRB).
  • ...and 3 more figures