Table of Contents
Fetching ...

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Changming Xiao, Qi Yang, Feng Zhou, Changshui Zhang

TL;DR

A novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation is revealed, found to be generalizable for the learned text embedding of customized generation methods, requiring only a few modifications.

Abstract

Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. In this work, a simple but effective method is proposed to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without re-training nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is found to be generalizable for the learned text embedding of customized generation methods, requiring only a few modifications. To validate our discovery, we introduce a new practical task called "personalized referring image segmentation" with a new dataset. Experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

TL;DR

A novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation is revealed, found to be generalizable for the learned text embedding of customized generation methods, requiring only a few modifications.

Abstract

Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. In this work, a simple but effective method is proposed to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without re-training nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is found to be generalizable for the learned text embedding of customized generation methods, requiring only a few modifications. To validate our discovery, we introduce a new practical task called "personalized referring image segmentation" with a new dataset. Experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.
Paper Structure (49 sections, 4 equations, 9 figures, 10 tables)

This paper contains 49 sections, 4 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: An overview of our proposed framework. We first add noise to the latent and then input it into the denoising U-net with specially designed text queries. Next, we combine cross-attention and self-attention in the model to obtain the correlation map between words and pixels. After comparing different correlation maps and post-processing with dense CRF dCRF, we attain pseudo masks at last. Best viewed in color.
  • Figure 2: Visualization of correlation maps. The texts on the left are the corresponding categories. The $2$-nd column depicts the spectral clustering result spectral utilizing the self-attention map, the $3$-rd column shows the cross-attention map, the $4$-st column displays the attention score attained after employing the clustering technique in CBP Mix-and-Match, and the last column shows our final correlation map after propagation. Best viewed in color.
  • Figure 3: Examples in our proposed dataset. The first $2$ columns display multi-view photos of personalized items and the $3$-rd column presents the image of different scenes. The last column shows the highlighted segmentation result along with the text query.
  • Figure 4: Visualizations of the pseudo masks generated by various methods. The $1$-st column shows the input image and the last column shows the ground truth mask. Uncertain pixels are set to white.
  • Figure 5: Localization results of different methods on Mug19 dataset samples. The $1$-st column shows the object, the last column shows the scene and the rest columns display highlighted segmentation masks with the text reference.
  • ...and 4 more figures