MacFormer: Semantic Segmentation with Fine Object Boundaries
Guoan Xu, Wenfeng Huang, Tao Wu, Ligeng Chen, Wenjing Jia, Guangwei Gao, Xiatian Zhu, Stuart Perry
TL;DR
MacFormer tackles the boundary prediction challenge in semantic segmentation by introducing two innovations: Mutual Agent Cross-Attention (MACA), which enables bidirectional feature exchange between encoder and decoder with controllable complexity via agent tokens, and a Frequency Enhancement Module (FEM), which leverages high- and low-frequency components to preserve boundary details. The approach is backbone-agnostic and achieves strong accuracy–efficiency trade-offs across Cityscapes and ADE20K, outperforming several state-of-the-art methods on multiple backbones and compute budgets. The combination of spatial mutual attention and frequency-domain refinement yields sharper boundaries and better small-object segmentation, demonstrating practical impact for dense prediction tasks. Overall, MacFormer offers a robust, flexible solution that integrates efficiently with existing architectures to enhance boundary-aware semantic segmentation.
Abstract
Semantic segmentation involves assigning a specific category to each pixel in an image. While Vision Transformer-based models have made significant progress, current semantic segmentation methods often struggle with precise predictions in localized areas like object boundaries. To tackle this challenge, we introduce a new semantic segmentation architecture, ``MacFormer'', which features two key components. Firstly, using learnable agent tokens, a Mutual Agent Cross-Attention (MACA) mechanism effectively facilitates the bidirectional integration of features across encoder and decoder layers. This enables better preservation of low-level features, such as elementary edges, during decoding. Secondly, a Frequency Enhancement Module (FEM) in the decoder leverages high-frequency and low-frequency components to boost features in the frequency domain, benefiting object boundaries with minimal computational complexity increase. MacFormer is demonstrated to be compatible with various network architectures and outperforms existing methods in both accuracy and efficiency on benchmark datasets ADE20K and Cityscapes under different computational constraints.
