Unsupervised Segmentation by Diffusing, Walking and Cutting
Daniela Ivanova, Marco Aversa, Paul Henderson, John Williamson
TL;DR
This work tackles unsupervised zero-shot semantic segmentation by leveraging diffusion model features, notably self-attention from Stable Diffusion, and applying Normalised Cuts to partition images into semantically coherent regions. A key insight is treating self-attention as a random-walk transition kernel, enabling NCut to be solved on the diffusion-derived transition matrix and, optionally, on adjacency matrices built from feature similarity; the approach also introduces a hyperparameter-free stopping criterion via dynamic thresholding and a power-walk mechanism to control semantic granularity. The method achieves state-of-the-art performance on COCO-Stuff-27 and Cityscapes among zero-shot approaches, while remaining model-agnostic and offering scalable, efficient inference. Overall, the paper provides a principled framework that combines diffusion-based representations with spectral clustering, producing hierarchically consistent segmentations without training, and opens avenues for applying diffusion-derived attention to other open-vocabulary vision tasks.
Abstract
We propose an unsupervised image segmentation method using features from pre-trained text-to-image diffusion models. Inspired by classic spectral clustering approaches, we construct adjacency matrices from self-attention layers between image patches and recursively partition using Normalised Cuts. A key insight is that self-attention probability distributions, which capture semantic relations between patches, can be interpreted as a transition matrix for random walks across the image. We leverage this by first using Random Walk Normalized Cuts directly on these self-attention activations to partition the image, minimizing transition probabilities between clusters while maximizing coherence within clusters. Applied recursively, this yields a hierarchical segmentation that reflects the rich semantics in the pre-trained attention layers, without any additional training. Next, we explore other ways to build the NCuts adjacency matrix from features, and how we can use the random walk interpretation of self-attention to capture long-range relationships. Finally, we propose an approach to automatically determine the NCut cost criterion, avoiding the need to tune this manually. We quantitatively analyse the effect incorporating different features, a constant versus dynamic NCut threshold, and incorporating multi-node paths when constructing the NCuts adjacency matrix. We show that our approach surpasses all existing methods for zero-shot unsupervised segmentation, achieving state-of-the-art results on COCO-Stuff-27 and Cityscapes.
