Table of Contents
Fetching ...

No Re-Train, More Gain: Upgrading Backbones with Diffusion model for Pixel-Wise and Weakly-Supervised Few-Shot Segmentation

Shuai Chen, Fanman Meng, Chenhao Wu, Haoran Wei, Runtong Zhang, Qingbo Wu, Linfeng Xu, Hongliang Li

TL;DR

DiffUp tackles the practical gaps in few-shot segmentation by reframing the task as a conditional diffusion process conditioned on backbone-agnostic priors derived from diverse annotations. It introduces BAFT to harmonize features across backbones, UAPF to fuse information from varying shot counts with uncertainty, and UQDD to guide diffusion via SCM and DQM for multi-granularity priors. The approach enables backbone upgrades without re-training, supports multiple annotation forms, and handles zero- to many-shot scenarios within a single model. Empirically, DiffUp achieves state-of-the-art results on Pascal-5i and COCO-20i, demonstrates strong cross-dataset transfer, and maintains competitive performance with reduced annotation and training constraints, offering a practical solution for flexible FSS deployment.

Abstract

Few-Shot Segmentation (FSS) aims to segment novel classes using only a few annotated images. Despite considerable progress under pixel-wise support annotation, current FSS methods still face three issues: the inflexibility of backbone upgrade without re-training, the inability to uniformly handle various types of annotations (e.g., scribble, bounding box, mask, and text), and the difficulty in accommodating different annotation quantity. To address these issues simultaneously, we propose DiffUp, a novel framework that conceptualizes the FSS task as a conditional generative problem using a diffusion process. For the first issue, we introduce a backbone-agnostic feature transformation module that converts different segmentation cues into unified coarse priors, facilitating seamless backbone upgrade without re-training. For the second issue, due to the varying granularity of transformed priors from diverse annotation types (scribble, bounding box, mask, and text), we conceptualize these multi-granular transformed priors as analogous to noisy intermediates at different steps of a diffusion model. This is implemented via a self-conditioned modulation block coupled with a dual-level quality modulation branch. For the third issue, we incorporate an uncertainty-aware information fusion module to harmonize the variability across zero-shot, one-shot, and many-shot scenarios. Evaluated through rigorous benchmarks, DiffUp significantly outperforms existing FSS models in terms of flexibility and accuracy.

No Re-Train, More Gain: Upgrading Backbones with Diffusion model for Pixel-Wise and Weakly-Supervised Few-Shot Segmentation

TL;DR

DiffUp tackles the practical gaps in few-shot segmentation by reframing the task as a conditional diffusion process conditioned on backbone-agnostic priors derived from diverse annotations. It introduces BAFT to harmonize features across backbones, UAPF to fuse information from varying shot counts with uncertainty, and UQDD to guide diffusion via SCM and DQM for multi-granularity priors. The approach enables backbone upgrades without re-training, supports multiple annotation forms, and handles zero- to many-shot scenarios within a single model. Empirically, DiffUp achieves state-of-the-art results on Pascal-5i and COCO-20i, demonstrates strong cross-dataset transfer, and maintains competitive performance with reduced annotation and training constraints, offering a practical solution for flexible FSS deployment.

Abstract

Few-Shot Segmentation (FSS) aims to segment novel classes using only a few annotated images. Despite considerable progress under pixel-wise support annotation, current FSS methods still face three issues: the inflexibility of backbone upgrade without re-training, the inability to uniformly handle various types of annotations (e.g., scribble, bounding box, mask, and text), and the difficulty in accommodating different annotation quantity. To address these issues simultaneously, we propose DiffUp, a novel framework that conceptualizes the FSS task as a conditional generative problem using a diffusion process. For the first issue, we introduce a backbone-agnostic feature transformation module that converts different segmentation cues into unified coarse priors, facilitating seamless backbone upgrade without re-training. For the second issue, due to the varying granularity of transformed priors from diverse annotation types (scribble, bounding box, mask, and text), we conceptualize these multi-granular transformed priors as analogous to noisy intermediates at different steps of a diffusion model. This is implemented via a self-conditioned modulation block coupled with a dual-level quality modulation branch. For the third issue, we incorporate an uncertainty-aware information fusion module to harmonize the variability across zero-shot, one-shot, and many-shot scenarios. Evaluated through rigorous benchmarks, DiffUp significantly outperforms existing FSS models in terms of flexibility and accuracy.
Paper Structure (34 sections, 16 equations, 9 figures, 10 tables)

This paper contains 34 sections, 16 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Comparison of approaches: (a) Traditional FSS methods use discriminative formulations, limiting backbone upgrades without retraining and struggling with diverse annotation types/quantities. (b) Our DiffUp adopts a generative approach, treating varying segmentation cues as diffusion process intermediates, enabling flexible backbone upgrades and supporting multiple annotation forms (scribbles, bounding boxes, masks, text) in various quantities.
  • Figure 2: Overview of DiffUp. (a) Training phase: Multi-scale features from base classes are extracted via a frozen RN50 backbone under various annotations (scribble, bounding-box, pixel-level, textual). Features undergo BAFT projection to a universal space (Section \ref{['subsubsection:BAFT']}), enabling seamless backbone upgrades. UAPF refines features by fusing segmentation cues with varying certainties (Section \ref{['subsubsection:UAPF']}). These refined priors condition a quality-aware diffusion decoder (Section \ref{['subsubsection:UQDD']}) to generate precise segmentations from Gaussian noise. (b) Inference phase: The system accommodates backbone upgrades through scaling (RN50 to RN101), architecture transitions (CNNs to ViTs), and different pre-training strategies (ImageNet to CLIP), ensuring robust performance on novel classes.
  • Figure 3: The Backbone-Agnostic Feature Transform (BAFT) block converts diverse segmentation cues into unified, backbone-agnostic priors. And the Uncertainty-Aware Prior Fusion (UAPF) block fuses these priors, handling varying annotation quantities and incorporating uncertainty.
  • Figure 4: The pipeline of the porposed Self-Conditioned Modulation Block (SCM).
  • Figure 5: The pipeline of Dual-level Quality Modulation branch (DQM), including the error map level and IoU level modulations.
  • ...and 4 more figures