Table of Contents
Fetching ...

Prompt-and-Transfer: Dynamic Class-aware Enhancement for Few-shot Segmentation

Hanbo Bi, Yingchao Feng, Wenhui Diao, Peijin Wang, Yongqiang Mao, Kun Fu, Hongqi Wang, Xian Sun

TL;DR

This work tackles the limitation of fixed, class-agnostic encoders in few-shot segmentation by introducing PAT, a prompt-driven framework that dynamically tunes the encoder to the target class in a given task. PAT leverages cross-modal language initialization (via CLIP), Semantic Prompt Transfer (SPT) with Gaussian suppression, and Part Mask Generator (PMG) to produce diverse, region-specific prompts that steer the encoder toward task-relevant objects. Through iterative prompting and transferring across the encoder, PAT achieves state-of-the-art results on standard FSS benchmarks and demonstrates strong cross-domain, weak-label, and zero-shot performance, underscoring its versatility and practical impact. The approach shifts emphasis from decoder-centric improvements to adaptive, class-aware encoding, offering a scalable path for robust generalization in flexible segmentation scenarios.

Abstract

For more efficient generalization to unseen domains (classes), most Few-shot Segmentation (FSS) would directly exploit pre-trained encoders and only fine-tune the decoder, especially in the current era of large models. However, such fixed feature encoders tend to be class-agnostic, inevitably activating objects that are irrelevant to the target class. In contrast, humans can effortlessly focus on specific objects in the line of sight. This paper mimics the visual perception pattern of human beings and proposes a novel and powerful prompt-driven scheme, called ``Prompt and Transfer" (PAT), which constructs a dynamic class-aware prompting paradigm to tune the encoder for focusing on the interested object (target class) in the current task. Three key points are elaborated to enhance the prompting: 1) Cross-modal linguistic information is introduced to initialize prompts for each task. 2) Semantic Prompt Transfer (SPT) that precisely transfers the class-specific semantics within the images to prompts. 3) Part Mask Generator (PMG) that works in conjunction with SPT to adaptively generate different but complementary part prompts for different individuals. Surprisingly, PAT achieves competitive performance on 4 different tasks including standard FSS, Cross-domain FSS (e.g., CV, medical, and remote sensing domains), Weak-label FSS, and Zero-shot Segmentation, setting new state-of-the-arts on 11 benchmarks.

Prompt-and-Transfer: Dynamic Class-aware Enhancement for Few-shot Segmentation

TL;DR

This work tackles the limitation of fixed, class-agnostic encoders in few-shot segmentation by introducing PAT, a prompt-driven framework that dynamically tunes the encoder to the target class in a given task. PAT leverages cross-modal language initialization (via CLIP), Semantic Prompt Transfer (SPT) with Gaussian suppression, and Part Mask Generator (PMG) to produce diverse, region-specific prompts that steer the encoder toward task-relevant objects. Through iterative prompting and transferring across the encoder, PAT achieves state-of-the-art results on standard FSS benchmarks and demonstrates strong cross-domain, weak-label, and zero-shot performance, underscoring its versatility and practical impact. The approach shifts emphasis from decoder-centric improvements to adaptive, class-aware encoding, offering a scalable path for robust generalization in flexible segmentation scenarios.

Abstract

For more efficient generalization to unseen domains (classes), most Few-shot Segmentation (FSS) would directly exploit pre-trained encoders and only fine-tune the decoder, especially in the current era of large models. However, such fixed feature encoders tend to be class-agnostic, inevitably activating objects that are irrelevant to the target class. In contrast, humans can effortlessly focus on specific objects in the line of sight. This paper mimics the visual perception pattern of human beings and proposes a novel and powerful prompt-driven scheme, called ``Prompt and Transfer" (PAT), which constructs a dynamic class-aware prompting paradigm to tune the encoder for focusing on the interested object (target class) in the current task. Three key points are elaborated to enhance the prompting: 1) Cross-modal linguistic information is introduced to initialize prompts for each task. 2) Semantic Prompt Transfer (SPT) that precisely transfers the class-specific semantics within the images to prompts. 3) Part Mask Generator (PMG) that works in conjunction with SPT to adaptively generate different but complementary part prompts for different individuals. Surprisingly, PAT achieves competitive performance on 4 different tasks including standard FSS, Cross-domain FSS (e.g., CV, medical, and remote sensing domains), Weak-label FSS, and Zero-shot Segmentation, setting new state-of-the-arts on 11 benchmarks.
Paper Structure (34 sections, 14 equations, 16 figures, 14 tables, 2 algorithms)

This paper contains 34 sections, 14 equations, 16 figures, 14 tables, 2 algorithms.

Figures (16)

  • Figure 1: (a) Comparison of our PAT and previous work. For more efficient generalization, most FSS methods prefer to directly employ the pre-trained encoders and only fine-tune the decoder. However, such frozen feature encoders tend to be class-agnostic, inevitably activating other classes irrelevant to the current FSS task (the 2nd row), due to their semantic clues derived from pre-training on image classification. (b) Human visual perception pattern. When processing visual stimuli from the retina, the cerebral cortex simultaneously parses stimuli from the mental, sound, etc., in the current state, selectively focusing on specific objects in view while leaving the rest parts in the shadow of consciousness. Inspired by the unique visual perception pattern, our "Prompt and Transfer" (PAT) method instead dynamically drives the encoder to focus on specific objects in a class-aware prompting manner (the 3rd row in (a)).
  • Figure 2: Overview of various Few-shot Segmentation (FSS) structures. (a) Prototype Matching-based methods zhang2020sgwang2019panetliu2020partfan2022self. (b) Feature Fusion-based methods chen2021apanetmin2021hypercorrelationcheng2022holisticliu2022learning. (c) Pixel Matching-based methods zhang2021fewwang2022adaptivexu2023selfpeng2023hierarchical. (d) Our "Prompt and Transfer" (PAT) method, dynamically generates part-level semantic prompts (i.e. part prompts) to tune the encoder for activating class-specific objects in the query image.
  • Figure 3: Our PAT has performed excellently (11 received SOTA) on 12 benchmarks over 4 different tasks. All results are obtained with the backbone of Deit-B/16 under the 1-shot setting (except for Zero-shot Segmentation).
  • Figure 4: Overview structure of the proposed PAT. PAT first derives the image tokens through the Embedding layer. Then a pre-trained text encoder is introduced to mine representative textual semantics, which along with randomly initialized embeddings are utilized as the initial prompts to interact with image features. For better enhancing the dynamic class-aware prompting, Prompt Enhancement is introduced to adaptively transfer the target semantics within a specific region (e.g., fine-grained local regions) from the support/query image to prompts via the Semantic Prompt Transfer and Part Mask Generator. These prompts will in turn interact with image features in the next encoder block to activate specific objects within the features. After several alternations of prompting and transferring in the encoder blocks, the derived prompts are directly utilized to perform similarity computation with the class-aware query feature to produce the segmentation results in the Matching Head. Notably, we only describe the encoder blocks that perform the Prompt Enhancement, while omitting others.
  • Figure 5: The specific process of the Part Mask Generator (PMG), which aims at adaptively generating a series of different part-level masks for different individuals.
  • ...and 11 more figures