Table of Contents
Fetching ...

Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, Mar Gonzalez-Franco

TL;DR

DiffSeg introduces unsupervised zero-shot segmentation by exploiting self-attention maps from a pre-trained Stable Diffusion model. It aggregates multi-resolution attention and iteratively merges maps using KL-divergence to form coherent segmentation masks without any training or prompts, achieving state-of-the-art results on COCO-Stuff-27 and strong zero-shot generalization on Cityscapes. An optional Semantic DiffSeg extension adds open-vocabulary labels by grounding cross-attention to nouns extracted from a caption. Overall, the approach demonstrates that object grouping and segmentation cues are emergent in diffusion-model attention and can be harnessed without supervision for open-world segmentation.

Abstract

Producing quality segmentation masks for images is a fundamental problem in computer vision. Recent research has explored large-scale supervised training to enable zero-shot segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU. The project page is at \url{https://sites.google.com/view/diffseg/home}.

Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

TL;DR

DiffSeg introduces unsupervised zero-shot segmentation by exploiting self-attention maps from a pre-trained Stable Diffusion model. It aggregates multi-resolution attention and iteratively merges maps using KL-divergence to form coherent segmentation masks without any training or prompts, achieving state-of-the-art results on COCO-Stuff-27 and strong zero-shot generalization on Cityscapes. An optional Semantic DiffSeg extension adds open-vocabulary labels by grounding cross-attention to nouns extracted from a caption. Overall, the approach demonstrates that object grouping and segmentation cues are emergent in diffusion-model attention and can be harnessed without supervision for open-world segmentation.

Abstract

Producing quality segmentation masks for images is a fundamental problem in computer vision. Recent research has explored large-scale supervised training to enable zero-shot segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU. The project page is at \url{https://sites.google.com/view/diffseg/home}.
Paper Structure (22 sections, 11 equations, 18 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 11 equations, 18 figures, 4 tables, 1 algorithm.

Figures (18)

  • Figure 1: Overview of DiffSeg. DiffSeg is an unsupervised and zero-shot segmentation algorithm using a pre-trained stable diffusion model. Starting from $M\times M$ anchor points, DiffSeg iteratively merges self-attention maps from the diffusion model for $N$ iterations to segment any image without any prior knowledge and external information.
  • Figure 2: Visualization of Segmentation Masks and Self-Attentions Tensors. Left: Overlay of segmentation and the original image. Right: Attention maps from a stable diffusion model have two properties: Intra-Attention Similarity and Inter-Attention Similarity. Maps of different resolutions have varying receptive fields w.r.t the original image.
  • Figure 3:
  • Figure 4: Effects of using Different Aggregation Weights $(R)$. DiffSeg uses a proportional aggregation strategy to balance consistency and detailedness. Higher-resolution maps produce more detailed but fractured segmentation while lower-resolution maps produce more consistent but coarse segmentation.
  • Figure 5: Examples of Segmentation on DomainNet Sketch. Overlay (left), input (middle), and segmentation (right)
  • ...and 13 more figures