Table of Contents
Fetching ...

MetaSeg: MetaFormer-based Global Contexts-aware Network for Efficient Semantic Segmentation

Beoungwoo Kang, Seunghun Moon, Yubin Cho, Hyunwoo Yu, Suk-Ju Kang

TL;DR

This work addresses the need for efficient, accurate semantic segmentation by extending the MetaFormer paradigm from backbones to decoders. It introduces Channel Reduction Attention (CRA) within a Global Meta Block (GMB) to capture global contexts with reduced computation, enabling a CNN-based encoder to pair effectively with a transformer-style decoder. Across ADE20K, Cityscapes, COCO-Stuff, and Synapse, MetaSeg achieves state-of-the-art trade-offs, outperforming prior methods while reducing FLOPs and enabling faster inference. The approach offers strong cross-domain applicability and demonstrates that MetaFormer capacity can be effectively leveraged throughout the network, not just in the encoder.

Abstract

Beyond the Transformer, it is important to explore how to exploit the capacity of the MetaFormer, an architecture that is fundamental to the performance improvements of the Transformer. Previous studies have exploited it only for the backbone network. Unlike previous studies, we explore the capacity of the Metaformer architecture more extensively in the semantic segmentation task. We propose a powerful semantic segmentation network, MetaSeg, which leverages the Metaformer architecture from the backbone to the decoder. Our MetaSeg shows that the MetaFormer architecture plays a significant role in capturing the useful contexts for the decoder as well as for the backbone. In addition, recent segmentation methods have shown that using a CNN-based backbone for extracting the spatial information and a decoder for extracting the global information is more effective than using a transformer-based backbone with a CNN-based decoder. This motivates us to adopt the CNN-based backbone using the MetaFormer block and design our MetaFormer-based decoder, which consists of a novel self-attention module to capture the global contexts. To consider both the global contexts extraction and the computational efficiency of the self-attention for semantic segmentation, we propose a Channel Reduction Attention (CRA) module that reduces the channel dimension of the query and key into the one dimension. In this way, our proposed MetaSeg outperforms the previous state-of-the-art methods with more efficient computational costs on popular semantic segmentation and a medical image segmentation benchmark, including ADE20K, Cityscapes, COCO-stuff, and Synapse. The code is available at https://github.com/hyunwoo137/MetaSeg.

MetaSeg: MetaFormer-based Global Contexts-aware Network for Efficient Semantic Segmentation

TL;DR

This work addresses the need for efficient, accurate semantic segmentation by extending the MetaFormer paradigm from backbones to decoders. It introduces Channel Reduction Attention (CRA) within a Global Meta Block (GMB) to capture global contexts with reduced computation, enabling a CNN-based encoder to pair effectively with a transformer-style decoder. Across ADE20K, Cityscapes, COCO-Stuff, and Synapse, MetaSeg achieves state-of-the-art trade-offs, outperforming prior methods while reducing FLOPs and enabling faster inference. The approach offers strong cross-domain applicability and demonstrates that MetaFormer capacity can be effectively leveraged throughout the network, not just in the encoder.

Abstract

Beyond the Transformer, it is important to explore how to exploit the capacity of the MetaFormer, an architecture that is fundamental to the performance improvements of the Transformer. Previous studies have exploited it only for the backbone network. Unlike previous studies, we explore the capacity of the Metaformer architecture more extensively in the semantic segmentation task. We propose a powerful semantic segmentation network, MetaSeg, which leverages the Metaformer architecture from the backbone to the decoder. Our MetaSeg shows that the MetaFormer architecture plays a significant role in capturing the useful contexts for the decoder as well as for the backbone. In addition, recent segmentation methods have shown that using a CNN-based backbone for extracting the spatial information and a decoder for extracting the global information is more effective than using a transformer-based backbone with a CNN-based decoder. This motivates us to adopt the CNN-based backbone using the MetaFormer block and design our MetaFormer-based decoder, which consists of a novel self-attention module to capture the global contexts. To consider both the global contexts extraction and the computational efficiency of the self-attention for semantic segmentation, we propose a Channel Reduction Attention (CRA) module that reduces the channel dimension of the query and key into the one dimension. In this way, our proposed MetaSeg outperforms the previous state-of-the-art methods with more efficient computational costs on popular semantic segmentation and a medical image segmentation benchmark, including ADE20K, Cityscapes, COCO-stuff, and Synapse. The code is available at https://github.com/hyunwoo137/MetaSeg.
Paper Structure (21 sections, 5 equations, 11 figures, 10 tables)

This paper contains 21 sections, 5 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Performance-Computation curves on ADE20K validation set. Compared the performance and computation of our MetaSeg with recent models guo2022segnextxie2021segformershim2023feedformercheng2021per. We find that our MetaSeg has the best trade-off between the performance and computational costs.
  • Figure 2: (a) Overall architecture of MetaSeg, consisting of two main part: hierarchical CNN-based Encoder and Global Meta Blcok (GMB) based decoder. (b) Details of the GMB, which is composed with the proposed Channel Reduction Attention (CRA) module and the channel MLP. Our MetaSeg extracts the multi-scale feature that contains local information in the encoder and complements the global information in the GMB of the decoder.
  • Figure 3: Illustration of the proposed Channel Reduction Attention (CRA). In our CRA, the channel dimension of the query and key is reduced to the one dimension for the computational efficiency and our CRA can capture the globality of the features effectively.
  • Figure 4: Visualization of our prediction maps and our attention score maps on ADE20K.
  • Figure 5: Qualitative results on ADE20K dataset. Compared to SegNeXt guo2022segnext, our MetaSeg predicts more detailed for various categories.
  • ...and 6 more figures