Table of Contents
Fetching ...

Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation

Jin Wang, Bingfeng Zhang, Jian Pang, Mengyu Liu, Honglong Chen, Weifeng Liu

TL;DR

This paper tackles the limitation of reliance on support images in few-shot segmentation by introducing Language-Driven Attribute Generalization (LDAG). LDAG uses Large Language Models to generate multiple attribute descriptions and aligns them with visual features through a cross-modal MaA module, enabling unbiased, generalizable guidance for both trained and untrained classes. The MaE component provides pixel-level multi-attribute priors, while MaA reinforces visual representations via contrastive text–visual alignment, all integrated with CLIP and SAM for robust decoding. Experiments on Pascal-5i and COCO-20i demonstrate state-of-the-art results and strong ablation evidence that text-based attributes, not support images, drive performance improvements and generalization in FSS.

Abstract

Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.

Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation

TL;DR

This paper tackles the limitation of reliance on support images in few-shot segmentation by introducing Language-Driven Attribute Generalization (LDAG). LDAG uses Large Language Models to generate multiple attribute descriptions and aligns them with visual features through a cross-modal MaA module, enabling unbiased, generalizable guidance for both trained and untrained classes. The MaE component provides pixel-level multi-attribute priors, while MaA reinforces visual representations via contrastive text–visual alignment, all integrated with CLIP and SAM for robust decoding. Experiments on Pascal-5i and COCO-20i demonstrate state-of-the-art results and strong ablation evidence that text-based attributes, not support images, drive performance improvements and generalization in FSS.

Abstract

Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.

Paper Structure

This paper contains 18 sections, 10 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison of prior masks and performance on some common classes using different FSS methods. (a) Support images with ground-truth masks (yellow); (b) Query images with ground-truth masks (blue); (c) Prior information from previous visual-visual matching SOTA method sun2024vrp, which is difficult to capture target class; (d) Prior information from previous visual-text alignment SOTA method wang2024rethinking with fixed text template "a photo of {target class}", which focus only on parts of target class, e.g., the mirror of bus, the head of cat. (e) Our proposed multiple visual-text prior information, which focuses more on the overall area of target class rather than locally distinct areas.
  • Figure 2: Overview of LDAG. MaE leverages LLMs to generate attribute text through Q$\&$A reasoning, multiple attribute prior is generated utilizing softmax-GradCAM. MaA promotes visual attribute representations via text-vision contrastive learning. Notably, competitive performance is preserved when removing MaA, i.e., support images are deleted, demonstrating its inherent robustness.
  • Figure 3: Qualitative results of the proposed LDAG and other SAM-based method (VRP-SAM sun2024vrp) approach under 1-shot setting. Each row from top to bottom represents the support images with ground-truth (GT) masks (blue), query images with GT masks (red), baseline results (purple), and our results (green), respectively.
  • Figure 4: Visualization of prior information comparison from PASCAL-5$^{i}$shaban2017one and COCO-20$^{i}$nguyen2019feature. " Image " represents the query image, " Prior-E " represents the prior information generated by fixed text descriptions, " Prior-O " represents our proposed prior information. Our proposed method is able to focus on different regions of the target class through different attribute information, which provides richer reference information and reduces the dependence of the FSS task on support images.