Multi-Level Aggregation and Recursive Alignment Architecture for Efficient Parallel Inference Segmentation Network
Yanhua Zhang, Ke Zhang, Jingyu Wang, Yulin Wu, Wuwei Wang
TL;DR
The paper tackles real-time semantic segmentation by proposing MFARANet, a single-pass network that achieves a favorable speed–accuracy balance through three core components: MFAM for multi-level feature aggregation, RAM for efficient recursive alignment of multi-scale scores, and ASFM for adaptive score-level fusion. Complemented by Multi-scale Joint Supervision (MJS), the method delivers robust multi-scale predictions with real-time inference speeds on standard benchmarks. Extensive ablations validate the contributions of MFAM, RAM, ASFM, and MJS, and cross-dataset experiments on Cityscapes, CamVid, and PASCAL-Context demonstrate generality. The results show MFARANet attains competitive accuracy with significantly reduced complexity compared with accuracy-oriented models, making it a practical option for real-world deployments.
Abstract
Real-time semantic segmentation is a crucial research for real-world applications. However, many methods lay particular emphasis on reducing the computational complexity and model size, while largely sacrificing the accuracy. To tackle this problem, we propose a parallel inference network customized for semantic segmentation tasks to achieve a good trade-off between speed and accuracy. We employ a shallow backbone to ensure real-time speed, and propose three core components to compensate for the reduced model capacity to improve accuracy. Specifically, we first design a dual-pyramidal path architecture (Multi-level Feature Aggregation Module, MFAM) to aggregate multi-level features from the encoder to each scale, providing hierarchical clues for subsequent spatial alignment and corresponding in-network inference. Then, we build Recursive Alignment Module (RAM) by combining the flow-based alignment module with recursive upsampling architecture for accurate spatial alignment between multi-scale feature maps with half the computational complexity of the straightforward alignment method. Finally, we perform independent parallel inference on the aligned features to obtain multi-scale scores, and adaptively fuse them through an attention-based Adaptive Scores Fusion Module (ASFM) so that the final prediction can favor objects of multiple scales. Our framework shows a better balance between speed and accuracy than state-of-the-art real-time methods on Cityscapes and CamVid datasets. We also conducted systematic ablation studies to gain insight into our motivation and architectural design. Code is available at: https://github.com/Yanhua-Zhang/MFARANet.
