Table of Contents
Fetching ...

MacFormer: Semantic Segmentation with Fine Object Boundaries

Guoan Xu, Wenfeng Huang, Tao Wu, Ligeng Chen, Wenjing Jia, Guangwei Gao, Xiatian Zhu, Stuart Perry

TL;DR

MacFormer tackles the boundary prediction challenge in semantic segmentation by introducing two innovations: Mutual Agent Cross-Attention (MACA), which enables bidirectional feature exchange between encoder and decoder with controllable complexity via agent tokens, and a Frequency Enhancement Module (FEM), which leverages high- and low-frequency components to preserve boundary details. The approach is backbone-agnostic and achieves strong accuracy–efficiency trade-offs across Cityscapes and ADE20K, outperforming several state-of-the-art methods on multiple backbones and compute budgets. The combination of spatial mutual attention and frequency-domain refinement yields sharper boundaries and better small-object segmentation, demonstrating practical impact for dense prediction tasks. Overall, MacFormer offers a robust, flexible solution that integrates efficiently with existing architectures to enhance boundary-aware semantic segmentation.

Abstract

Semantic segmentation involves assigning a specific category to each pixel in an image. While Vision Transformer-based models have made significant progress, current semantic segmentation methods often struggle with precise predictions in localized areas like object boundaries. To tackle this challenge, we introduce a new semantic segmentation architecture, ``MacFormer'', which features two key components. Firstly, using learnable agent tokens, a Mutual Agent Cross-Attention (MACA) mechanism effectively facilitates the bidirectional integration of features across encoder and decoder layers. This enables better preservation of low-level features, such as elementary edges, during decoding. Secondly, a Frequency Enhancement Module (FEM) in the decoder leverages high-frequency and low-frequency components to boost features in the frequency domain, benefiting object boundaries with minimal computational complexity increase. MacFormer is demonstrated to be compatible with various network architectures and outperforms existing methods in both accuracy and efficiency on benchmark datasets ADE20K and Cityscapes under different computational constraints.

MacFormer: Semantic Segmentation with Fine Object Boundaries

TL;DR

MacFormer tackles the boundary prediction challenge in semantic segmentation by introducing two innovations: Mutual Agent Cross-Attention (MACA), which enables bidirectional feature exchange between encoder and decoder with controllable complexity via agent tokens, and a Frequency Enhancement Module (FEM), which leverages high- and low-frequency components to preserve boundary details. The approach is backbone-agnostic and achieves strong accuracy–efficiency trade-offs across Cityscapes and ADE20K, outperforming several state-of-the-art methods on multiple backbones and compute budgets. The combination of spatial mutual attention and frequency-domain refinement yields sharper boundaries and better small-object segmentation, demonstrating practical impact for dense prediction tasks. Overall, MacFormer offers a robust, flexible solution that integrates efficiently with existing architectures to enhance boundary-aware semantic segmentation.

Abstract

Semantic segmentation involves assigning a specific category to each pixel in an image. While Vision Transformer-based models have made significant progress, current semantic segmentation methods often struggle with precise predictions in localized areas like object boundaries. To tackle this challenge, we introduce a new semantic segmentation architecture, ``MacFormer'', which features two key components. Firstly, using learnable agent tokens, a Mutual Agent Cross-Attention (MACA) mechanism effectively facilitates the bidirectional integration of features across encoder and decoder layers. This enables better preservation of low-level features, such as elementary edges, during decoding. Secondly, a Frequency Enhancement Module (FEM) in the decoder leverages high-frequency and low-frequency components to boost features in the frequency domain, benefiting object boundaries with minimal computational complexity increase. MacFormer is demonstrated to be compatible with various network architectures and outperforms existing methods in both accuracy and efficiency on benchmark datasets ADE20K and Cityscapes under different computational constraints.
Paper Structure (27 sections, 19 equations, 7 figures, 9 tables)

This paper contains 27 sections, 19 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Addressing the challenge of accurate segmentation at object boundaries, which often faces interference from neighboring categories, our MacFormer offers a promising solution. Visual comparison to SegFormer xie2021segformer showcases our method's superior performance in segmenting detailed features, particularly at the edges of individuals and vehicles.
  • Figure 2: The architecture of our proposed MacFormer consists of several key components. Initially, multi-scale feature maps are obtained in the Encoder. Subsequently, the Decoder incorporates the Mutual Agent Cross-Attention (MACA) mechanism and Frequency Enhancement Module (FEM) to improve the encoder features. Finally, the enhanced features are aggregated and forwarded to the segmentation head to produce the ultimate prediction.
  • Figure 3: The proposed Mutual Agent Cross-Attention (MACA) mechanism involves cross-operations among tokens from various features, along with tokens represented as $A$, resulting in enhanced performance. The computational complexity can be managed by adjusting the $A$ dimension.
  • Figure 4: The illustration of the proposed Frequency Enhancement Module (FEM). The advantage of $E_1$ lies in its capability of capturing detailed high-frequency information, while $E_i$ excels at retaining low-frequency semantic context.
  • Figure 5: Visualization of feature maps. The first row depicts the output feature maps from different stages of the Encoder. The second row illustrates the visualization of feature maps obtained after incorporating frequency domain information using FEM, where mutual supplementation can be observed. Through this supplementation, both detailed and semantic information are enhanced and fused. $F$ represents the result obtained after concatenating the three features, $F_{12}$, $F_{13}$ and $F_{14}$.
  • ...and 2 more figures