Table of Contents
Fetching ...

Unsupervised Segmentation by Diffusing, Walking and Cutting

Daniela Ivanova, Marco Aversa, Paul Henderson, John Williamson

TL;DR

This work tackles unsupervised zero-shot semantic segmentation by leveraging diffusion model features, notably self-attention from Stable Diffusion, and applying Normalised Cuts to partition images into semantically coherent regions. A key insight is treating self-attention as a random-walk transition kernel, enabling NCut to be solved on the diffusion-derived transition matrix and, optionally, on adjacency matrices built from feature similarity; the approach also introduces a hyperparameter-free stopping criterion via dynamic thresholding and a power-walk mechanism to control semantic granularity. The method achieves state-of-the-art performance on COCO-Stuff-27 and Cityscapes among zero-shot approaches, while remaining model-agnostic and offering scalable, efficient inference. Overall, the paper provides a principled framework that combines diffusion-based representations with spectral clustering, producing hierarchically consistent segmentations without training, and opens avenues for applying diffusion-derived attention to other open-vocabulary vision tasks.

Abstract

We propose an unsupervised image segmentation method using features from pre-trained text-to-image diffusion models. Inspired by classic spectral clustering approaches, we construct adjacency matrices from self-attention layers between image patches and recursively partition using Normalised Cuts. A key insight is that self-attention probability distributions, which capture semantic relations between patches, can be interpreted as a transition matrix for random walks across the image. We leverage this by first using Random Walk Normalized Cuts directly on these self-attention activations to partition the image, minimizing transition probabilities between clusters while maximizing coherence within clusters. Applied recursively, this yields a hierarchical segmentation that reflects the rich semantics in the pre-trained attention layers, without any additional training. Next, we explore other ways to build the NCuts adjacency matrix from features, and how we can use the random walk interpretation of self-attention to capture long-range relationships. Finally, we propose an approach to automatically determine the NCut cost criterion, avoiding the need to tune this manually. We quantitatively analyse the effect incorporating different features, a constant versus dynamic NCut threshold, and incorporating multi-node paths when constructing the NCuts adjacency matrix. We show that our approach surpasses all existing methods for zero-shot unsupervised segmentation, achieving state-of-the-art results on COCO-Stuff-27 and Cityscapes.

Unsupervised Segmentation by Diffusing, Walking and Cutting

TL;DR

This work tackles unsupervised zero-shot semantic segmentation by leveraging diffusion model features, notably self-attention from Stable Diffusion, and applying Normalised Cuts to partition images into semantically coherent regions. A key insight is treating self-attention as a random-walk transition kernel, enabling NCut to be solved on the diffusion-derived transition matrix and, optionally, on adjacency matrices built from feature similarity; the approach also introduces a hyperparameter-free stopping criterion via dynamic thresholding and a power-walk mechanism to control semantic granularity. The method achieves state-of-the-art performance on COCO-Stuff-27 and Cityscapes among zero-shot approaches, while remaining model-agnostic and offering scalable, efficient inference. Overall, the paper provides a principled framework that combines diffusion-based representations with spectral clustering, producing hierarchically consistent segmentations without training, and opens avenues for applying diffusion-derived attention to other open-vocabulary vision tasks.

Abstract

We propose an unsupervised image segmentation method using features from pre-trained text-to-image diffusion models. Inspired by classic spectral clustering approaches, we construct adjacency matrices from self-attention layers between image patches and recursively partition using Normalised Cuts. A key insight is that self-attention probability distributions, which capture semantic relations between patches, can be interpreted as a transition matrix for random walks across the image. We leverage this by first using Random Walk Normalized Cuts directly on these self-attention activations to partition the image, minimizing transition probabilities between clusters while maximizing coherence within clusters. Applied recursively, this yields a hierarchical segmentation that reflects the rich semantics in the pre-trained attention layers, without any additional training. Next, we explore other ways to build the NCuts adjacency matrix from features, and how we can use the random walk interpretation of self-attention to capture long-range relationships. Finally, we propose an approach to automatically determine the NCut cost criterion, avoiding the need to tune this manually. We quantitatively analyse the effect incorporating different features, a constant versus dynamic NCut threshold, and incorporating multi-node paths when constructing the NCuts adjacency matrix. We show that our approach surpasses all existing methods for zero-shot unsupervised segmentation, achieving state-of-the-art results on COCO-Stuff-27 and Cityscapes.

Paper Structure

This paper contains 28 sections, 3 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: We perform unsupervised segmentation by applying Normalised Cuts shi2000normalized to self-attention features from Stable Diffusion rombach22sd in a hyperparameter-free setting, and achieve superior performance on Coco-Stuff-27 caesar2018coco compared to trained and zero-shot methods.
  • Figure 2: Qualitative comparison on COCO-Stuff-27 between our simple Random Walk approach for different NCut cost thresholds (columns 2-4) and DiffSeg for different Kullback-Leibler Divergence thresholds (columns 5-7).
  • Figure 3: Self-attention PDF (aggregated) (a) and corresponding dot product (b) and cosine similarity (c) adjacency values for a single patch.
  • Figure 4: Qualitative comparison on COCO-Stuff-27 between NCut over a dot-product Adjacency matrix (columns 2-4) and over a cosine similarity Adjacency matrix (columns 5-7) across a range of NCut thresholds.
  • Figure 5: Qualitative comparison of segmentation results for our two automatic thresholding approaches: Scaled MinCut (columns 2-5) and NCut (columns 6-9) across different self-attention resolution levels.
  • ...and 8 more figures