Table of Contents
Fetching ...

A Decoding Scheme with Successive Aggregation of Multi-Level Features for Light-Weight Semantic Segmentation

Jiwon Yoo, Jangwon Lee, Gyeonghwan Kim

TL;DR

The paper tackles the high computational burden of transformer-based semantic segmentation on high-resolution imagery by introducing SASFormer, a light-weight decoder that exploits multi-level features through Accumulated Semantics Extractor (ASE) and Semantic Combining Module (SCM). ASE uses successive cross-attention to extract aggregated semantics across downsampled multi-scale features, maintaining contextual consistency while reducing cost. SCM then uses these aggregated semantics as weights to refine multi-scale features before final segmentation, achieving a favorable accuracy-cost trade-off. Experiments on ADE20K and Cityscapes show state-of-the-art efficiency and competitive accuracy, with extensive ablations validating the effectiveness of the successive cross-attention and SCM designs, and the approach proving adaptable as a decoder for other HVTransformer-based models.

Abstract

Multi-scale architecture, including hierarchical vision transformer, has been commonly applied to high-resolution semantic segmentation to deal with computational complexity with minimum performance loss. In this paper, we propose a novel decoding scheme for semantic segmentation in this regard, which takes multi-level features from the encoder with multi-scale architecture. The decoding scheme based on a multi-level vision transformer aims to achieve not only reduced computational expense but also higher segmentation accuracy, by introducing successive cross-attention in aggregation of the multi-level features. Furthermore, a way to enhance the multi-level features by the aggregated semantics is proposed. The effort is focused on maintaining the contextual consistency from the perspective of attention allocation and brings improved performance with significantly lower computational cost. Set of experiments on popular datasets demonstrates superiority of the proposed scheme to the state-of-the-art semantic segmentation models in terms of computational cost without loss of accuracy, and extensive ablation studies prove the effectiveness of ideas proposed.

A Decoding Scheme with Successive Aggregation of Multi-Level Features for Light-Weight Semantic Segmentation

TL;DR

The paper tackles the high computational burden of transformer-based semantic segmentation on high-resolution imagery by introducing SASFormer, a light-weight decoder that exploits multi-level features through Accumulated Semantics Extractor (ASE) and Semantic Combining Module (SCM). ASE uses successive cross-attention to extract aggregated semantics across downsampled multi-scale features, maintaining contextual consistency while reducing cost. SCM then uses these aggregated semantics as weights to refine multi-scale features before final segmentation, achieving a favorable accuracy-cost trade-off. Experiments on ADE20K and Cityscapes show state-of-the-art efficiency and competitive accuracy, with extensive ablations validating the effectiveness of the successive cross-attention and SCM designs, and the approach proving adaptable as a decoder for other HVTransformer-based models.

Abstract

Multi-scale architecture, including hierarchical vision transformer, has been commonly applied to high-resolution semantic segmentation to deal with computational complexity with minimum performance loss. In this paper, we propose a novel decoding scheme for semantic segmentation in this regard, which takes multi-level features from the encoder with multi-scale architecture. The decoding scheme based on a multi-level vision transformer aims to achieve not only reduced computational expense but also higher segmentation accuracy, by introducing successive cross-attention in aggregation of the multi-level features. Furthermore, a way to enhance the multi-level features by the aggregated semantics is proposed. The effort is focused on maintaining the contextual consistency from the perspective of attention allocation and brings improved performance with significantly lower computational cost. Set of experiments on popular datasets demonstrates superiority of the proposed scheme to the state-of-the-art semantic segmentation models in terms of computational cost without loss of accuracy, and extensive ablation studies prove the effectiveness of ideas proposed.
Paper Structure (14 sections, 6 equations, 4 figures, 5 tables)

This paper contains 14 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The structure of semantic segmentation scheme that employs HVT-based encoder. The performance relies on how the decoder effectively fuses the features of different layers.
  • Figure 2: The overall structure of the HVT-based semantic segmentation where the proposed SASFormer is used as the decoder. Q, K, and V represent query, key, and value, respectively. The figure illustrates how the successive aggregation of the multi-scale features is performed and the aggregated semantics are employed in the proposed ASE and SCM, respectively.
  • Figure 3: Two attention configurations can be adopted in ASE for the ablation study of the successive cross-attention: (a) self-attention, and (b) cross-attention.
  • Figure 4: Qualitative comparison of an image in ADE20K. The results of three HVT-based models are compared to the ones obtained by the models in which the decoder is replaced with SASFormer. For all three, models with the proposed scheme result in more precise segmentation in the boxed areas where multi-scale objects are present.