Table of Contents
Fetching ...

MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping

Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh

TL;DR

MSDNet tackles few-shot semantic segmentation by combining a lightweight transformer-based prototyping framework with multi-scale decode-and-refine capabilities. It introduces a Spatial Transformer Decoder (STD) that uses the support prototype as Query and the query features as Key/Value, along with a Contextual Mask Generation Module (CMGM) to provide pixel-wise relational priors, and a hierarchical Multi-Scale Decoder to refine masks across resolutions. The backbone remains shared and efficient, merging mid-level encoder features to enrich context, and the approach achieves competitive results on PASCAL-5^i and COCO-20^i with only 1.5M parameters. Overall, MSDNet demonstrates a strong performance-efficiency trade-off and offers a practical pathway for robust cross-dataset generalization in few-shot segmentation.

Abstract

Few-shot Semantic Segmentation addresses the challenge of segmenting objects in query images with only a handful of annotated examples. However, many previous state-of-the-art methods either have to discard intricate local semantic features or suffer from high computational complexity. To address these challenges, we propose a new Few-shot Semantic Segmentation framework based on the Transformer architecture. Our approach introduces the spatial transformer decoder and the contextual mask generation module to improve the relational understanding between support and query images. Moreover, we introduce a multi scale decoder to refine the segmentation mask by incorporating features from different resolutions in a hierarchical manner. Additionally, our approach integrates global features from intermediate encoder stages to improve contextual understanding, while maintaining a lightweight structure to reduce complexity. This balance between performance and efficiency enables our method to achieve competitive results on benchmark datasets such as PASCAL-5^i and COCO-20^i in both 1-shot and 5-shot settings. Notably, our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies. https://github.com/amirrezafateh/MSDNet

MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping

TL;DR

MSDNet tackles few-shot semantic segmentation by combining a lightweight transformer-based prototyping framework with multi-scale decode-and-refine capabilities. It introduces a Spatial Transformer Decoder (STD) that uses the support prototype as Query and the query features as Key/Value, along with a Contextual Mask Generation Module (CMGM) to provide pixel-wise relational priors, and a hierarchical Multi-Scale Decoder to refine masks across resolutions. The backbone remains shared and efficient, merging mid-level encoder features to enrich context, and the approach achieves competitive results on PASCAL-5^i and COCO-20^i with only 1.5M parameters. Overall, MSDNet demonstrates a strong performance-efficiency trade-off and offers a practical pathway for robust cross-dataset generalization in few-shot segmentation.

Abstract

Few-shot Semantic Segmentation addresses the challenge of segmenting objects in query images with only a handful of annotated examples. However, many previous state-of-the-art methods either have to discard intricate local semantic features or suffer from high computational complexity. To address these challenges, we propose a new Few-shot Semantic Segmentation framework based on the Transformer architecture. Our approach introduces the spatial transformer decoder and the contextual mask generation module to improve the relational understanding between support and query images. Moreover, we introduce a multi scale decoder to refine the segmentation mask by incorporating features from different resolutions in a hierarchical manner. Additionally, our approach integrates global features from intermediate encoder stages to improve contextual understanding, while maintaining a lightweight structure to reduce complexity. This balance between performance and efficiency enables our method to achieve competitive results on benchmark datasets such as PASCAL-5^i and COCO-20^i in both 1-shot and 5-shot settings. Notably, our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies. https://github.com/amirrezafateh/MSDNet
Paper Structure (21 sections, 5 equations, 5 figures, 5 tables)

This paper contains 21 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison among existing methods and our proposed method for FSS. (a) Prototype-based methods; (b) Pixel-wise methods; (c) The proposed method builds upon prototype-based strategies while enhancing contextual understanding and segmentation quality through Transformer-guided prototyping and multi-scale decoding.
  • Figure 2: the overview of the proposed method
  • Figure 3: Spatial Transformer Decoder
  • Figure 4: Qualitative comparison of component effects in 1-shot scenario for (a) $COCO\text{-}20^i$ and (b) $Pascal\text{-}5^i$ datasets.
  • Figure 5: The overview of Multi Scale Decoder with different number of residual blocks in each stage (1-4)