Table of Contents
Fetching ...

Repurposing Stable Diffusion Attention for Training-Free Unsupervised Interactive Segmentation

Markus Karmann, Onay Urfalioglu

TL;DR

This work introduces M2N2, a training-free unsupervised interactive segmentation framework that reinterprets Stable Diffusion self-attention as a Markov transition operator to build per-prompt Markov-maps. By applying iterative Markov processes, a temperature-controlled, doubly stochastic attention, and a flood-fill refinement, M2N2 enables instance-aware segmentation via truncated nearest neighbor across multiple user prompts. The approach achieves state-of-the-art NoC on several datasets without any segmentation labels and demonstrates robustness across backbones, especially SD2 with high-resolution attention maps. While effective, it notes limitations with thin structures and overlapping/obstructed instances, pointing to future work on improving fine detail capture and handling complex scenes. Overall, M2N2 offers a fast, training-free alternative for interactive segmentation with strong practical impact for rapid labeling and GUI-based annotation workflows.

Abstract

Recent progress in interactive point prompt based Image Segmentation allows to significantly reduce the manual effort to obtain high quality semantic labels. State-of-the-art unsupervised methods use self-supervised pre-trained models to obtain pseudo-labels which are used in training a prompt-based segmentation model. In this paper, we propose a novel unsupervised and training-free approach based solely on the self-attention of Stable Diffusion. We interpret the self-attention tensor as a Markov transition operator, which enables us to iteratively construct a Markov chain. Pixel-wise counting of the required number of iterations along the Markov chain to reach a relative probability threshold yields a Markov-iteration-map, which we simply call a Markov-map. Compared to the raw attention maps, we show that our proposed Markov-map has less noise, sharper semantic boundaries and more uniform values within semantically similar regions. We integrate the Markov-map in a simple yet effective truncated nearest neighbor framework to obtain interactive point prompt based segmentation. Despite being training-free, we experimentally show that our approach yields excellent results in terms of Number of Clicks (NoC), even outperforming state-of-the-art training based unsupervised methods in most of the datasets. Code is available at https://github.com/mkarmann/m2n2.

Repurposing Stable Diffusion Attention for Training-Free Unsupervised Interactive Segmentation

TL;DR

This work introduces M2N2, a training-free unsupervised interactive segmentation framework that reinterprets Stable Diffusion self-attention as a Markov transition operator to build per-prompt Markov-maps. By applying iterative Markov processes, a temperature-controlled, doubly stochastic attention, and a flood-fill refinement, M2N2 enables instance-aware segmentation via truncated nearest neighbor across multiple user prompts. The approach achieves state-of-the-art NoC on several datasets without any segmentation labels and demonstrates robustness across backbones, especially SD2 with high-resolution attention maps. While effective, it notes limitations with thin structures and overlapping/obstructed instances, pointing to future work on improving fine detail capture and handling complex scenes. Overall, M2N2 offers a fast, training-free alternative for interactive segmentation with strong practical impact for rapid labeling and GUI-based annotation workflows.

Abstract

Recent progress in interactive point prompt based Image Segmentation allows to significantly reduce the manual effort to obtain high quality semantic labels. State-of-the-art unsupervised methods use self-supervised pre-trained models to obtain pseudo-labels which are used in training a prompt-based segmentation model. In this paper, we propose a novel unsupervised and training-free approach based solely on the self-attention of Stable Diffusion. We interpret the self-attention tensor as a Markov transition operator, which enables us to iteratively construct a Markov chain. Pixel-wise counting of the required number of iterations along the Markov chain to reach a relative probability threshold yields a Markov-iteration-map, which we simply call a Markov-map. Compared to the raw attention maps, we show that our proposed Markov-map has less noise, sharper semantic boundaries and more uniform values within semantically similar regions. We integrate the Markov-map in a simple yet effective truncated nearest neighbor framework to obtain interactive point prompt based segmentation. Despite being training-free, we experimentally show that our approach yields excellent results in terms of Number of Clicks (NoC), even outperforming state-of-the-art training based unsupervised methods in most of the datasets. Code is available at https://github.com/mkarmann/m2n2.

Paper Structure

This paper contains 23 sections, 12 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: M2N2 framework overview. We perform a single denoising step of the input image with Stable Diffusion 2 to obtain attention tensors. The tensors are aggregated and utilized to obtain a Markov-map $M_i$ for each prompt point. The final segmentation is the result of a truncated nearest neighbor of scaled Markov-maps $M_i$ as a measure of semantic distance for each prompt point. The green and red areas in the scaled Markov-maps denote regions where the distance is less or equal to the global background threshold. In this visualization, components in blue contain adjustable hyperparameters.
  • Figure 2: Comparison of semantic maps. Each map is generated from a single prompt point. For better comparison, Markov-maps are inverted such that the lowest value is white and the highest value is black.
  • Figure 3: Impact of the hyperparameters of SD and Markov-map, respectively, on all four datasets, each represented by a single color. Dashed lines correspond to NoC85, continues lines to NoC90. The graph of SBD is based on a randomly sampled subset of 500 images.
  • Figure 4: Segmentation examples on DAVIS DAVIS_DAtaset. Each column shows examples slected on the NoC90 value, ranging from easy cases $\mathrm{NoC90}=1$ on the left to difficult cases $\mathrm{NoC90}=10$ and failure cases $\mathrm{NoC90}=20$ on the right. Foreground points are shown in green and background points in red. The bottom right example is especially difficult for M2N2 by only having small thin isolated structures.
  • Figure 5: Generation process of a Markov-map. Each column shows the current state of the probability distribution $p_t$ and the corresponding Markov-map $M$ for a given number of iterations $t$. The first row contains the input image and prompt point. The second and third row show the probability distributions $p_t$ of the original attention tensor without IPF and the doubly stochastic attention tensor resulting from applying IPF. The last two rows are the Markov-maps $M$ with and without using flood fill. The Markov-maps are shown with the maximum number of iterations set to $t$. Each map is scaled up to the image resolution with nearest-neighbor interpolation instead of JBU for better comparison.
  • ...and 7 more figures