No Re-Train, More Gain: Upgrading Backbones with Diffusion model for Pixel-Wise and Weakly-Supervised Few-Shot Segmentation
Shuai Chen, Fanman Meng, Chenhao Wu, Haoran Wei, Runtong Zhang, Qingbo Wu, Linfeng Xu, Hongliang Li
TL;DR
DiffUp tackles the practical gaps in few-shot segmentation by reframing the task as a conditional diffusion process conditioned on backbone-agnostic priors derived from diverse annotations. It introduces BAFT to harmonize features across backbones, UAPF to fuse information from varying shot counts with uncertainty, and UQDD to guide diffusion via SCM and DQM for multi-granularity priors. The approach enables backbone upgrades without re-training, supports multiple annotation forms, and handles zero- to many-shot scenarios within a single model. Empirically, DiffUp achieves state-of-the-art results on Pascal-5i and COCO-20i, demonstrates strong cross-dataset transfer, and maintains competitive performance with reduced annotation and training constraints, offering a practical solution for flexible FSS deployment.
Abstract
Few-Shot Segmentation (FSS) aims to segment novel classes using only a few annotated images. Despite considerable progress under pixel-wise support annotation, current FSS methods still face three issues: the inflexibility of backbone upgrade without re-training, the inability to uniformly handle various types of annotations (e.g., scribble, bounding box, mask, and text), and the difficulty in accommodating different annotation quantity. To address these issues simultaneously, we propose DiffUp, a novel framework that conceptualizes the FSS task as a conditional generative problem using a diffusion process. For the first issue, we introduce a backbone-agnostic feature transformation module that converts different segmentation cues into unified coarse priors, facilitating seamless backbone upgrade without re-training. For the second issue, due to the varying granularity of transformed priors from diverse annotation types (scribble, bounding box, mask, and text), we conceptualize these multi-granular transformed priors as analogous to noisy intermediates at different steps of a diffusion model. This is implemented via a self-conditioned modulation block coupled with a dual-level quality modulation branch. For the third issue, we incorporate an uncertainty-aware information fusion module to harmonize the variability across zero-shot, one-shot, and many-shot scenarios. Evaluated through rigorous benchmarks, DiffUp significantly outperforms existing FSS models in terms of flexibility and accuracy.
