Table of Contents
Fetching ...

Unbiased Semantic Decoding with Vision Foundation Models for Few-shot Segmentation

Jin Wang, Bingfeng Zhang, Jian Pang, Weifeng Liu, Baodi Liu, Honglong Chen

TL;DR

The paper tackles the bias toward base classes in SAM-based few-shot segmentation by introducing Unbiased Semantic Decoding (USD), which jointly leverages CLIP semantics and SAM decoding. USD employs a Global Supplement Module (GSM) for image-level semantic enrichment, a Local Guidance Module (LGM) for pixel-level localization, and a Visual-Text Target Prompt Generator (VTPG) to create target-focused prompts, all without retraining the foundation models. The approach yields state-of-the-art results on PASCAL-5i and COCO-20i, demonstrates robust performance under domain shift and low-resource settings, and reduces reliance on handcrafted prompts by fusing multi-modal predictions. Overall, USD significantly improves generalization to novel classes in few-shot segmentation and offers practical benefits for open-world segmentation tasks.

Abstract

Few-shot segmentation has garnered significant attention. Many recent approaches attempt to introduce the Segment Anything Model (SAM) to handle this task. With the strong generalization ability and rich object-specific extraction ability of the SAM model, such a solution shows great potential in few-shot segmentation. However, the decoding process of SAM highly relies on accurate and explicit prompts, making previous approaches mainly focus on extracting prompts from the support set, which is insufficient to activate the generalization ability of SAM, and this design is easy to result in a biased decoding process when adapting to the unknown classes. In this work, we propose an Unbiased Semantic Decoding (USD) strategy integrated with SAM, which extracts target information from both the support and query set simultaneously to perform consistent predictions guided by the semantics of the Contrastive Language-Image Pre-training (CLIP) model. Specifically, to enhance the unbiased semantic discrimination of SAM, we design two feature enhancement strategies that leverage the semantic alignment capability of CLIP to enrich the original SAM features, mainly including a global supplement at the image level to provide a generalize category indicate with support image and a local guidance at the pixel level to provide a useful target location with query image. Besides, to generate target-focused prompt embeddings, a learnable visual-text target prompt generator is proposed by interacting target text embeddings and clip visual features. Without requiring re-training of the vision foundation models, the features with semantic discrimination draw attention to the target region through the guidance of prompt with rich target information.

Unbiased Semantic Decoding with Vision Foundation Models for Few-shot Segmentation

TL;DR

The paper tackles the bias toward base classes in SAM-based few-shot segmentation by introducing Unbiased Semantic Decoding (USD), which jointly leverages CLIP semantics and SAM decoding. USD employs a Global Supplement Module (GSM) for image-level semantic enrichment, a Local Guidance Module (LGM) for pixel-level localization, and a Visual-Text Target Prompt Generator (VTPG) to create target-focused prompts, all without retraining the foundation models. The approach yields state-of-the-art results on PASCAL-5i and COCO-20i, demonstrates robust performance under domain shift and low-resource settings, and reduces reliance on handcrafted prompts by fusing multi-modal predictions. Overall, USD significantly improves generalization to novel classes in few-shot segmentation and offers practical benefits for open-world segmentation tasks.

Abstract

Few-shot segmentation has garnered significant attention. Many recent approaches attempt to introduce the Segment Anything Model (SAM) to handle this task. With the strong generalization ability and rich object-specific extraction ability of the SAM model, such a solution shows great potential in few-shot segmentation. However, the decoding process of SAM highly relies on accurate and explicit prompts, making previous approaches mainly focus on extracting prompts from the support set, which is insufficient to activate the generalization ability of SAM, and this design is easy to result in a biased decoding process when adapting to the unknown classes. In this work, we propose an Unbiased Semantic Decoding (USD) strategy integrated with SAM, which extracts target information from both the support and query set simultaneously to perform consistent predictions guided by the semantics of the Contrastive Language-Image Pre-training (CLIP) model. Specifically, to enhance the unbiased semantic discrimination of SAM, we design two feature enhancement strategies that leverage the semantic alignment capability of CLIP to enrich the original SAM features, mainly including a global supplement at the image level to provide a generalize category indicate with support image and a local guidance at the pixel level to provide a useful target location with query image. Besides, to generate target-focused prompt embeddings, a learnable visual-text target prompt generator is proposed by interacting target text embeddings and clip visual features. Without requiring re-training of the vision foundation models, the features with semantic discrimination draw attention to the target region through the guidance of prompt with rich target information.

Paper Structure

This paper contains 32 sections, 19 equations, 11 figures, 12 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison of different FSS methods. (a) Existing methods of prototype-level or pixel-level matching FSS methods, obtain predictions by designing fine-grained decoders. (b) The framework of existing SAM-based FSS methods, which primarily aims at extracting visual features from support image using ImageNet pre-trained CNN weights for prompt generation. (c) The framework of existing CLIP-based FSS methods, which primarily obtain coarse map generation by visual-visual matching or visual-text matching. (d) Our proposed strategy integrates visual-text prompt generation from both support and query images with a frozen decoder, and enhances the semantic information of SAM features by complementing them with CLIP features.
  • Figure 2: Overview of the proposed Unbiased Semantic Decoding method under 1-shot setting. We design two feature enhancement strategies, image-level GSM and pixel-level LGM, to enhance the semantic information of SAM features using semantically rich CLIP features. The GSM enhances SAM features with image-level semantic information by mapping CLIP visual features to SAM space. The LGM enriches SAM features by mining semantic correlations between pixels to obtain pixel-level category representations. Finally, without any re-training of the SAM and CLIP models, VTPG is proposed to generate a multi-modal target to further activate the target regions.
  • Figure 3: Changes in mIoU at different experimental settings during the training process for PASCAL-5$i$ dataset, USD reaches its optimal effect and stabilizes before 50 epochs, faster than the previous method that required 100 epochs or more.
  • Figure 4: Changes in Loss at different experimental settings during the training process for PASCAL-5$i$ dataset.
  • Figure 5: Qualitative results of the proposed USD and VRP-SAM approach under 1-shot setting from both PASCAL-5$^{i}$and COCO-20$^{i}$ datasets. Each row from top to bottom represents the support images with ground-truth (GT) masks (blue), query images with GT masks (red), VRP-SAM results (purple), and our results (green), respectively.
  • ...and 6 more figures