Table of Contents
Fetching ...

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou, Chunhua Shen

TL;DR

This work tackles the high cost of pixel-level annotations for semantic segmentation by extracting class-specific, high-resolution masks from cross-attention maps in a pre-trained diffusion model (Stable Diffusion). It introduces DiffuMask, a pipeline that converts attention signals into masks with adaptive binarization, noise learning, prompt engineering, and data augmentation to bridge the synthetic-real domain gap. The approach yields competitive results on VOC 2012, Cityscapes, and ADE20K, with strong zero-shot performance on unseen VOC classes, showcasing the potential of text-driven synthetic data to replace or augment real annotations. The findings suggest a practical path toward scalable segmentation datasets generated entirely from textual prompts.

Abstract

Collecting and annotating images with pixel-wise labels is time-consuming and laborious. In contrast, synthetic data can be freely available using a generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model, which uses only text-image pairs during training. Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image, which is natural and seamless to extend the text-driven image synthesis to semantic mask generation. DiffuMask uses text-guided cross-attention information to localize class/word-specific regions, which are combined with practical techniques to create a novel high-resolution and class-discriminative pixel-wise mask. The methods help to reduce data collection and annotation costs obviously. Experiments demonstrate that the existing segmentation methods trained on synthetic data of DiffuMask can achieve a competitive performance over the counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird), DiffuMask presents promising performance, close to the stateof-the-art result of real data (within 3% mIoU gap). Moreover, in the open-vocabulary segmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on Unseen class of VOC 2012. The project website can be found at https://weijiawu.github.io/DiffusionMask/.

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

TL;DR

This work tackles the high cost of pixel-level annotations for semantic segmentation by extracting class-specific, high-resolution masks from cross-attention maps in a pre-trained diffusion model (Stable Diffusion). It introduces DiffuMask, a pipeline that converts attention signals into masks with adaptive binarization, noise learning, prompt engineering, and data augmentation to bridge the synthetic-real domain gap. The approach yields competitive results on VOC 2012, Cityscapes, and ADE20K, with strong zero-shot performance on unseen VOC classes, showcasing the potential of text-driven synthetic data to replace or augment real annotations. The findings suggest a practical path toward scalable segmentation datasets generated entirely from textual prompts.

Abstract

Collecting and annotating images with pixel-wise labels is time-consuming and laborious. In contrast, synthetic data can be freely available using a generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model, which uses only text-image pairs during training. Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image, which is natural and seamless to extend the text-driven image synthesis to semantic mask generation. DiffuMask uses text-guided cross-attention information to localize class/word-specific regions, which are combined with practical techniques to create a novel high-resolution and class-discriminative pixel-wise mask. The methods help to reduce data collection and annotation costs obviously. Experiments demonstrate that the existing segmentation methods trained on synthetic data of DiffuMask can achieve a competitive performance over the counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird), DiffuMask presents promising performance, close to the stateof-the-art result of real data (within 3% mIoU gap). Moreover, in the open-vocabulary segmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on Unseen class of VOC 2012. The project website can be found at https://weijiawu.github.io/DiffusionMask/.
Paper Structure (17 sections, 4 equations, 11 figures, 8 tables)

This paper contains 17 sections, 4 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: DiffuMask synthesizes photo-realistic images and high-quality mask annotations by exploiting the attention maps of the diffusion model. Without human effort for localization DiffuMask is capable of producing high-quality semantic masks.
  • Figure 2: Cross attention maps of different text tokens.
  • Figure 3: Cross attention maps of different resolutions.
  • Figure 4: Binarization Mask with different thresholds $\gamma$ in Equ. \ref{['eq:binarization']}.
  • Figure 6: Relationship between mask quality (IoU) and threshold for various categories.$1k$ generative images are used for each class from Stable Diffusion rombach2022high. Mask2former cheng2022masked pre-trained on Pascal-VOC 2012 (voc)everingham2010pascal is used to generate the ground truth. The optimal threshold of different classes usually is different.
  • ...and 6 more figures