Table of Contents
Fetching ...

Make It Up: Fake Images, Real Gains in Generalized Few-shot Semantic Segmentation

Guohuan Xie, Xin He, Dingying Fan, Le Zhang, Ming-Ming Cheng, Yun Liu

Abstract

Generalized few-shot semantic segmentation (GFSS) is fundamentally limited by the coverage of novel-class appearances under scarce annotations. While diffusion models can synthesize novel-class images at scale, practical gains are often hindered by insufficient coverage and noisy supervision when masks are unavailable or unreliable. We propose Syn4Seg, a generation-enhanced GFSS framework designed to expand novel-class coverage while improving pseudo-label quality. Syn4Seg first maximizes prompt-space coverage by constructing an embedding-deduplicated prompt bank for each novel class, yielding diverse yet class-consistent synthetic images. It then performs support-guided pseudo-label estimation via a two-stage refinement that i) filters low-consistency regions to obtain high-precision seeds and ii) relabels uncertain pixels with image-adaptive prototypes that combine global (support) and local (image) statistics. Finally, we refine only boundary-band and unlabeled pixels using a constrained SAM-based update to improve contour fidelity without overwriting high-confidence interiors. Extensive experiments on PASCAL-$5^i$ and COCO-$20^i$ demonstrate consistent improvements in both 1-shot and 5-shot settings, highlighting synthetic data as a scalable path for GFSS with reliable masks and precise boundaries.

Make It Up: Fake Images, Real Gains in Generalized Few-shot Semantic Segmentation

Abstract

Generalized few-shot semantic segmentation (GFSS) is fundamentally limited by the coverage of novel-class appearances under scarce annotations. While diffusion models can synthesize novel-class images at scale, practical gains are often hindered by insufficient coverage and noisy supervision when masks are unavailable or unreliable. We propose Syn4Seg, a generation-enhanced GFSS framework designed to expand novel-class coverage while improving pseudo-label quality. Syn4Seg first maximizes prompt-space coverage by constructing an embedding-deduplicated prompt bank for each novel class, yielding diverse yet class-consistent synthetic images. It then performs support-guided pseudo-label estimation via a two-stage refinement that i) filters low-consistency regions to obtain high-precision seeds and ii) relabels uncertain pixels with image-adaptive prototypes that combine global (support) and local (image) statistics. Finally, we refine only boundary-band and unlabeled pixels using a constrained SAM-based update to improve contour fidelity without overwriting high-confidence interiors. Extensive experiments on PASCAL- and COCO- demonstrate consistent improvements in both 1-shot and 5-shot settings, highlighting synthetic data as a scalable path for GFSS with reliable masks and precise boundaries.

Paper Structure

This paper contains 14 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Pipelines of classical GFSS works and our Syn4Seg framework. (a) Classical GFSS methods follow three stages: base class learning, novel class registration, and evaluation, where novel classes rely on limited manual annotations, resulting in low diversity and constrained performance. (b) Our Syn4Seg framework synthesizes diverse, realistic novel class images and pseudo labels guided by support annotations. These synthesized samples, together with base data, are used to train a semantic segmentation model, enabling robust recognition of both base and novel classes during evaluation.
  • Figure 2: (a), (b) Given several novel classes and their support set, we first use HDIG to synthesize high-quality, diverse novel class images proportional to the base class images. Guided by support images, APE refines these synthesized images to generate accurate masks through two stages: Adaptive Pseudo-label Filtering (APF) and Adaptive Pseudo-label Relabeling (APR). Specifically, APF filters noisy regions in the initial mask $M$ to obtain $M_1$; APR relabels these unlabeled regions to yield $M_2$. Finally, SAM-based Boundary Refinement (SABR) further refines the mask boundaries, yielding the final $M_3$. (c), (d) illustrate the details of APF and APR.
  • Figure 3: Qualitative comparison between images generated directly using class-name prompts (a) and those generated with our HDIG method (b).
  • Figure 4: Quantitative ablation results showing the progressive improvements brought by the mask-quality refinement modules: APF, APR, and SABR.
  • Figure 5: Qualitative results of the proposed Syn4Seg and state-of-the-art approach under the 1-shot setting. The left panel illustrates results on PASCAL-$5^i$, and the right panel on COCO-$20^i$. The first row shows the input images and their corresponding ground-truth masks, followed by the results of VP and BCM in the second and third rows, respectively. The fourth row presents the results of our Syn4Seg.
  • ...and 1 more figures