Table of Contents
Fetching ...

Diffusion Models for Open-Vocabulary Segmentation

Laurynas Karazija, Iro Laina, Andrea Vedaldi, Christian Rupprecht

TL;DR

OVDiff addresses open-vocabulary semantic segmentation without collecting data or performing training by synthesizing category-specific support sets with diffusion models and grounding them via multiple prototypes. The method unfolds in three stages: generate support images for each category, extract and aggregate foreground, background, and part prototypes, and segment target images by cosine similarity to these prototypes in a shared feature space. It introduces category pre-filtering and stuff-vs-things filtering to reduce spurious matches and directly models background through negative prototypes, achieving state-of-the-art results on VOC/Context/Object without supervision. This framework demonstrates how contextual priors embedded in generative models can enable scalable, data-free open-vocabulary segmentation with strong performance in both standard benchmarks and in-the-wild scenarios.

Abstract

Open-vocabulary segmentation is the task of segmenting anything that can be named in an image. Recently, large-scale vision-language modelling has led to significant advances in open-vocabulary segmentation, but at the cost of gargantuan and increasing training and annotation efforts. Hence, we ask if it is possible to use existing foundation models to synthesise on-demand efficient segmentation algorithms for specific class sets, making them applicable in an open-vocabulary setting without the need to collect further data, annotations or perform training. To that end, we present OVDiff, a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. OVDiff synthesises support image sets for arbitrary textual categories, creating for each a set of prototypes representative of both the category and its surrounding context (background). It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training. Our approach shows strong performance on a range of benchmarks, obtaining a lead of more than 5% over prior work on PASCAL VOC.

Diffusion Models for Open-Vocabulary Segmentation

TL;DR

OVDiff addresses open-vocabulary semantic segmentation without collecting data or performing training by synthesizing category-specific support sets with diffusion models and grounding them via multiple prototypes. The method unfolds in three stages: generate support images for each category, extract and aggregate foreground, background, and part prototypes, and segment target images by cosine similarity to these prototypes in a shared feature space. It introduces category pre-filtering and stuff-vs-things filtering to reduce spurious matches and directly models background through negative prototypes, achieving state-of-the-art results on VOC/Context/Object without supervision. This framework demonstrates how contextual priors embedded in generative models can enable scalable, data-free open-vocabulary segmentation with strong performance in both standard benchmarks and in-the-wild scenarios.

Abstract

Open-vocabulary segmentation is the task of segmenting anything that can be named in an image. Recently, large-scale vision-language modelling has led to significant advances in open-vocabulary segmentation, but at the cost of gargantuan and increasing training and annotation efforts. Hence, we ask if it is possible to use existing foundation models to synthesise on-demand efficient segmentation algorithms for specific class sets, making them applicable in an open-vocabulary setting without the need to collect further data, annotations or perform training. To that end, we present OVDiff, a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. OVDiff synthesises support image sets for arbitrary textual categories, creating for each a set of prototypes representative of both the category and its surrounding context (background). It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training. Our approach shows strong performance on a range of benchmarks, obtaining a lead of more than 5% over prior work on PASCAL VOC.
Paper Structure (48 sections, 4 equations, 11 figures, 10 tables)

This paper contains 48 sections, 4 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: OVDiff is an open-vocabulary segmentation method that, given an image and a free-form set of class names, can segment any user-defined classes. It is fully automatic and does not require any further training.
  • Figure 2: OVDiff overview. Prototype sampling: text queries are used to sample a set of support images which are further processed by a feature extractor and a segmenter forming positive and negative (background) prototypes. Segmentation: image features are compared against prototypes. The CLIP filter removes irrelevant prototypes based on global image contents.
  • Figure 3: Qualitative results. OVDiff in comparison to TCL (+ PAMR). OVDiff provides more accurate segmentations across a range objects and stuff classes with well defined object boundaries that separate from the background well.
  • Figure 4: PascalVOC results with increasing support size $N$.
  • Figure 5: Analysis of the segmentation output by linking regions to samples in the support set. Left: our results for different classes. Middle: select color-coded regions "activated" by different prototypes for the class. Right: regions in the support set images corresponding to these (part-level) prototypes.
  • ...and 6 more figures