Table of Contents
Fetching ...

gen2seg: Generative Models Enable Generalizable Instance Segmentation

Om Khangaonkar, Hamed Pirsiavash

TL;DR

Gen2seg investigates whether generative pretraining can enable generalizable, category-agnostic instance segmentation. By finetuning Stable Diffusion and MAE on a narrow synthetic domain with an instance-coloring loss, the approach achieves strong zero-shot generalization to unseen object types and image styles, yielding crisper boundaries than several baselines. It approaches or matches SAM on multiple unseen domains and excels at segmenting fine structures and boundaries, supporting the claim that generative priors encode robust perceptual grouping. The work highlights a scalable, data-efficient direction for generalizable perception with potential impact on robotics, medical imaging, and autonomous systems.

Abstract

By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE's ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.

gen2seg: Generative Models Enable Generalizable Instance Segmentation

TL;DR

Gen2seg investigates whether generative pretraining can enable generalizable, category-agnostic instance segmentation. By finetuning Stable Diffusion and MAE on a narrow synthetic domain with an instance-coloring loss, the approach achieves strong zero-shot generalization to unseen object types and image styles, yielding crisper boundaries than several baselines. It approaches or matches SAM on multiple unseen domains and excels at segmenting fine structures and boundaries, supporting the claim that generative priors encode robust perceptual grouping. The work highlights a scalable, data-efficient direction for generalizable perception with potential impact on robotics, medical imaging, and autonomous systems.

Abstract

By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE's ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.

Paper Structure

This paper contains 21 sections, 7 equations, 19 figures, 7 tables.

Figures (19)

  • Figure 1: The model that generated the segmentation maps above has never seen masks of humans, animals, or anything remotely similar. We fine-tune generative models for instance segmentation using a synthetic dataset that contains only labeled masks of indoor furnishings and cars. Despite never seeing masks for many object types and image styles present in the visual world, our models are able to generalize effectively. They also learn to accurately segment fine details, occluded objects, and ambiguous boundaries.
  • Figure 2: To showcase the potential of generative models for instance segmentation, we highlight an example from each evaluation dataset where most or all of our models outperform SAM, despite never having seen masks of these object types. SAM often fails on fine structures (wires) or ambiguous boundaries (horses & carriage), leaving black regions where no object was detected. DINO-B also performs poorly, suggesting that generative pretraining (e.g., MAE, Stable Diffusion) learns strong priors for perceptual grouping.
  • Figure 3: Our models assign similar colors to compositionally related parts of a scene. Vader's mask and body (top), or the bowties and shirts (bottom) are separated by subtly different hues, while distinct colors partition unrelated parts such as his leg and the poles (top), or the dogs and text (bottom). This emerges without any part-level supervision, suggesting generative models learn hierarchical scene representations. More samples are provided in Figures \ref{['fig:big1']} to \ref{['fig:big7']}.
  • Figure 4: For qualitative comparison, we showcase several results for promptable segmentation using our features. Our finetuned MAE-B and SimpleClick are trained on the same data, using the same backbone, yet our MAE-B strongly outperforms SimpleClick due to its generative prior. Our finetuned Stable Diffusion has never seen a mask of the object type it is segmenting, but performs similar to SAM, which has been heavily supervised on over a billion masks of all types. Prompt points are shown in green on the input.
  • Figure 5: We evaluate segmentation quality as the number of prompt points increases. Our SD model marginally exceeds SAM at 1 prompt point and recovers $>$82% of SAM's performance at 9 prompt points. This is surprising as we do not use a learned mask encoder for multiple prompt points, but simply merge similarity maps computed individually from each point prompt.
  • ...and 14 more figures