Table of Contents
Fetching ...

Multi-Level Aggregation and Recursive Alignment Architecture for Efficient Parallel Inference Segmentation Network

Yanhua Zhang, Ke Zhang, Jingyu Wang, Yulin Wu, Wuwei Wang

TL;DR

The paper tackles real-time semantic segmentation by proposing MFARANet, a single-pass network that achieves a favorable speed–accuracy balance through three core components: MFAM for multi-level feature aggregation, RAM for efficient recursive alignment of multi-scale scores, and ASFM for adaptive score-level fusion. Complemented by Multi-scale Joint Supervision (MJS), the method delivers robust multi-scale predictions with real-time inference speeds on standard benchmarks. Extensive ablations validate the contributions of MFAM, RAM, ASFM, and MJS, and cross-dataset experiments on Cityscapes, CamVid, and PASCAL-Context demonstrate generality. The results show MFARANet attains competitive accuracy with significantly reduced complexity compared with accuracy-oriented models, making it a practical option for real-world deployments.

Abstract

Real-time semantic segmentation is a crucial research for real-world applications. However, many methods lay particular emphasis on reducing the computational complexity and model size, while largely sacrificing the accuracy. To tackle this problem, we propose a parallel inference network customized for semantic segmentation tasks to achieve a good trade-off between speed and accuracy. We employ a shallow backbone to ensure real-time speed, and propose three core components to compensate for the reduced model capacity to improve accuracy. Specifically, we first design a dual-pyramidal path architecture (Multi-level Feature Aggregation Module, MFAM) to aggregate multi-level features from the encoder to each scale, providing hierarchical clues for subsequent spatial alignment and corresponding in-network inference. Then, we build Recursive Alignment Module (RAM) by combining the flow-based alignment module with recursive upsampling architecture for accurate spatial alignment between multi-scale feature maps with half the computational complexity of the straightforward alignment method. Finally, we perform independent parallel inference on the aligned features to obtain multi-scale scores, and adaptively fuse them through an attention-based Adaptive Scores Fusion Module (ASFM) so that the final prediction can favor objects of multiple scales. Our framework shows a better balance between speed and accuracy than state-of-the-art real-time methods on Cityscapes and CamVid datasets. We also conducted systematic ablation studies to gain insight into our motivation and architectural design. Code is available at: https://github.com/Yanhua-Zhang/MFARANet.

Multi-Level Aggregation and Recursive Alignment Architecture for Efficient Parallel Inference Segmentation Network

TL;DR

The paper tackles real-time semantic segmentation by proposing MFARANet, a single-pass network that achieves a favorable speed–accuracy balance through three core components: MFAM for multi-level feature aggregation, RAM for efficient recursive alignment of multi-scale scores, and ASFM for adaptive score-level fusion. Complemented by Multi-scale Joint Supervision (MJS), the method delivers robust multi-scale predictions with real-time inference speeds on standard benchmarks. Extensive ablations validate the contributions of MFAM, RAM, ASFM, and MJS, and cross-dataset experiments on Cityscapes, CamVid, and PASCAL-Context demonstrate generality. The results show MFARANet attains competitive accuracy with significantly reduced complexity compared with accuracy-oriented models, making it a practical option for real-world deployments.

Abstract

Real-time semantic segmentation is a crucial research for real-world applications. However, many methods lay particular emphasis on reducing the computational complexity and model size, while largely sacrificing the accuracy. To tackle this problem, we propose a parallel inference network customized for semantic segmentation tasks to achieve a good trade-off between speed and accuracy. We employ a shallow backbone to ensure real-time speed, and propose three core components to compensate for the reduced model capacity to improve accuracy. Specifically, we first design a dual-pyramidal path architecture (Multi-level Feature Aggregation Module, MFAM) to aggregate multi-level features from the encoder to each scale, providing hierarchical clues for subsequent spatial alignment and corresponding in-network inference. Then, we build Recursive Alignment Module (RAM) by combining the flow-based alignment module with recursive upsampling architecture for accurate spatial alignment between multi-scale feature maps with half the computational complexity of the straightforward alignment method. Finally, we perform independent parallel inference on the aligned features to obtain multi-scale scores, and adaptively fuse them through an attention-based Adaptive Scores Fusion Module (ASFM) so that the final prediction can favor objects of multiple scales. Our framework shows a better balance between speed and accuracy than state-of-the-art real-time methods on Cityscapes and CamVid datasets. We also conducted systematic ablation studies to gain insight into our motivation and architectural design. Code is available at: https://github.com/Yanhua-Zhang/MFARANet.
Paper Structure (30 sections, 13 equations, 9 figures, 12 tables)

This paper contains 30 sections, 13 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Inference speed and Params vs. mIoU accuracy on the Cityscapes test set. The size of the points indicates the model size. Purple points are representative real-time methods in Table. \ref{['Comparison of accuracy and speed on Cityscapes.']}. The red and green points represent our MFARANet using 1024 $\times$ 1024 or the whole image (1024 $\times$ 2048) input for inference, respectively. 'Pruned' indicates the special network pruning method used for our parallel inference network, which is shown in Fig. \ref{['Pruning illustration']} and discussed in Section. \ref{['Ablation for Scale Selection']}. All FPS are measured on a single GTX 3090 GPU with the image resolution on which the inference is performed to calculate the accuracy.
  • Figure 2: The comparison between different parallel inference networks. The boxes represent feature maps, and their length and number roughly reflect the relative spatial resolution between them. (a) Independently passing the scaled images through the whole segmentation network (omitted for simplicity) to obtain multi-scale predictions. (b) The in-network parallel inference architecture proposed by Feature Pyramid Networks (FPN) lin2017feature for object detection tasks. (c) Our custom-designed parallel inference network for fine-grained real-time semantic segmentation tasks, which is much faster than image pyramid.
  • Figure 3: Overall architecture of our approach. (a) The details of MFAM. ${S_i}$, ${D_i}$ and ${U_i}$ represent multi-level features from different stages of ResNet-18, Bottom-up path and Top-down path, respectively. (b) The process of aligning low-resolution features with the highest-resolution feature ${F_1}$. The detailed structure of RAM is illustrated in Fig. \ref{['Recursive Alignment']}. ${F_i}$ indicates the i-th scale feature obtained from MFAM, and $P_i$ denotes the corresponding aligned feature. (c) The architecture of ASFM. “Seg Head” and “Attention Head” represent the modules for obtaining the score map (${\rm{Scor}}{{\rm{e}}_i}$) and weight map (${\rm{Weigh}}{{\rm{t}}_i}$) of the i-th scale, which are shown in detail in Fig. \ref{['Heads']}.
  • Figure 4: The comparison between Straightforward Alignment and Recursive Alignment. The red dashed box indicates the flow based alignment module. ${\rm{Up}}$ represents bilinear interpolation. ${\rm{Concat}}$ denotes the channel concatenation operation. $f$ is the alignment function in Eq. \ref{['flow-based alignment module']}. (a) Straightforward Alignment Module. (b) Our proposed Recursive Alignment Module (RAM).
  • Figure 5: Long-range Skip Connection based Aggregation architecture (our initial design).
  • ...and 4 more figures