Table of Contents
Fetching ...

iSeg: An Iterative Refinement-based Framework for Training-free Segmentation

Lin Sun, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang

TL;DR

This work tackles training-free semantic segmentation by exploiting pre-trained stable diffusion attention maps. It introduces entropy-reduced self-attention (Ent-Self) to suppress irrelevant global information and a category-enhanced cross-attention (Cat-Cross) to provide a better initialization, enabling a stable iterative refinement of cross-attention maps. The proposed iSeg framework achieves state-of-the-art results across weakly supervised, open-vocabulary, unsupervised, and synthetic-mask tasks, with notable gains such as an absolute $3.8\%$ mIoU improvement on Cityscapes in unsupervised settings. Significantly, this approach demonstrates robust cross-domain capability and interaction flexibility without any segmentation-domain training, underscoring the practical potential of diffusion-based, training-free segmentation for diverse real-world applications.

Abstract

Stable diffusion has demonstrated strong image synthesis ability to given text descriptions, suggesting it to contain strong semantic clue for grouping objects. The researchers have explored employing stable diffusion for training-free segmentation. Most existing approaches refine cross-attention map by self-attention map once, demonstrating that self-attention map contains useful semantic information to improve segmentation. To fully utilize self-attention map, we present a deep experimental analysis on iteratively refining cross-attention map with self-attention map, and propose an effective iterative refinement framework for training-free segmentation, named iSeg. The proposed iSeg introduces an entropy-reduced self-attention module that utilizes a gradient descent scheme to reduce the entropy of self-attention map, thereby suppressing the weak responses corresponding to irrelevant global information. Leveraging the entropy-reduced self-attention module, our iSeg stably improves refined cross-attention map with iterative refinement. Further, we design a category-enhanced cross-attention module to generate accurate cross-attention map, providing a better initial input for iterative refinement. Extensive experiments across different datasets and diverse segmentation tasks reveal the merits of proposed contributions, leading to promising performance on diverse segmentation tasks. For unsupervised semantic segmentation on Cityscapes, our iSeg achieves an absolute gain of 3.8% in terms of mIoU compared to the best existing training-free approach in literature. Moreover, our proposed iSeg can support segmentation with different kinds of images and interactions. The project is available at https://linsun449.github.io/iSeg.

iSeg: An Iterative Refinement-based Framework for Training-free Segmentation

TL;DR

This work tackles training-free semantic segmentation by exploiting pre-trained stable diffusion attention maps. It introduces entropy-reduced self-attention (Ent-Self) to suppress irrelevant global information and a category-enhanced cross-attention (Cat-Cross) to provide a better initialization, enabling a stable iterative refinement of cross-attention maps. The proposed iSeg framework achieves state-of-the-art results across weakly supervised, open-vocabulary, unsupervised, and synthetic-mask tasks, with notable gains such as an absolute mIoU improvement on Cityscapes in unsupervised settings. Significantly, this approach demonstrates robust cross-domain capability and interaction flexibility without any segmentation-domain training, underscoring the practical potential of diffusion-based, training-free segmentation for diverse real-world applications.

Abstract

Stable diffusion has demonstrated strong image synthesis ability to given text descriptions, suggesting it to contain strong semantic clue for grouping objects. The researchers have explored employing stable diffusion for training-free segmentation. Most existing approaches refine cross-attention map by self-attention map once, demonstrating that self-attention map contains useful semantic information to improve segmentation. To fully utilize self-attention map, we present a deep experimental analysis on iteratively refining cross-attention map with self-attention map, and propose an effective iterative refinement framework for training-free segmentation, named iSeg. The proposed iSeg introduces an entropy-reduced self-attention module that utilizes a gradient descent scheme to reduce the entropy of self-attention map, thereby suppressing the weak responses corresponding to irrelevant global information. Leveraging the entropy-reduced self-attention module, our iSeg stably improves refined cross-attention map with iterative refinement. Further, we design a category-enhanced cross-attention module to generate accurate cross-attention map, providing a better initial input for iterative refinement. Extensive experiments across different datasets and diverse segmentation tasks reveal the merits of proposed contributions, leading to promising performance on diverse segmentation tasks. For unsupervised semantic segmentation on Cityscapes, our iSeg achieves an absolute gain of 3.8% in terms of mIoU compared to the best existing training-free approach in literature. Moreover, our proposed iSeg can support segmentation with different kinds of images and interactions. The project is available at https://linsun449.github.io/iSeg.
Paper Structure (15 sections, 9 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 15 sections, 9 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of naive iteration strategy and our entropy-reduced self-attention (Ent-Self) module. We feed text prompt (a) and image (b) into pre-trained stable diffusion to extract cross-attention (c) and self-attention maps. We only select cross-attention maps corresponding to category names, and do not show self-attention map here for simplicity. Then, we refine these selected cross-attention maps with self-attention map using naive iteration strategy and our Ent-Self. Naive iteration strategy directly refines cross-attention maps using original self-attention map, which leads to noisy cross-attention maps after multiple iterations, such as the feature maps of categories 'people' and 'bird' in (d). Compared to naive iteration, our Ent-Self generates accurate refined cross-attention maps of categories 'people' and 'bird' in (e). In (f), we give quantitative comparison on pseudo mask generation of PASCAL VOC in weakly-supervised semantic segmentation. Compared to naive iteration, using our Ent-Self can improve mask generation with the increment of iterations.
  • Figure 2: Cross-attention map refinement with predicted self-attention map and ground-truth self-attention map. Given the image (a) and cross-attention map (b), we first give the refined cross-attention maps at different iterations using predicted self-attention map in (c). At the same time, we present the refined cross-attention maps using ground-truth (GT) self-attention map in (d), where GT self-attention map is generated by ground-truth masks. Finally, we compare the predicted self-attention map and GT self-attention map at selected point in (e), where predicted self-attention map is noisy and GT self-attention map is clean.
  • Figure 3: Architecture of our proposed training-free iSeg. iSeg comprises two novel modules. The Cat-Cross module generates more accurate cross-attention map, while the Ent-Self module reduces irrelevant information in self-attention map. Given an image with paired text prompt, we first get latent feature $z$ and embedding feature $\varepsilon$ by VAE encoder and Cat-Cross module respectively. These features are fed into denoising U-net to extract cross-attention map $A_{\mathrm{ca}}$ and self-attention map $A_{\mathrm{sa}}$. Then the Ent-Self module is applied to reduce entropy of $A_{\mathrm{sa}}$ and obtain $A_{\mathrm{sa}}^{\mathrm{ent}}$. Finally, an iterative refinement is conducted to refine cross-attention map with entropy-reduced self-attention map.
  • Figure 4: Refined self-attention map at selected points. Our iterative refinement with entropy-reduced self-attention can be viewed as to generate the better refined self-attention map. After multiple iterations, the refined self-attention map becomes more similar to the ground-truth self-attention map at corresponding red point. The refined self-attention map can better refine the cross-attention map.
  • Figure 5: Comparison of cross-attention maps before and after Cat-Cross module. Compared to the original cross-attention map (b), the refined cross-attention map (c) is more clean, and has strong response around corresponding objects in red bounding-box.
  • ...and 7 more figures