Table of Contents
Fetching ...

MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation

Yasufumi Kawano, Yoshimitsu Aoki

TL;DR

The proposed MaskDiffusion is an innovative approach that leverages pretrained frozen Stable Diffusion to achieve open-vocabulary semantic segmentation without the need for additional training or annotation, leading to improved performance compared to similar methods.

Abstract

Semantic segmentation is essential in computer vision for various applications, yet traditional approaches face significant challenges, including the high cost of annotation and extensive training for supervised learning. Additionally, due to the limited predefined categories in supervised learning, models typically struggle with infrequent classes and are unable to predict novel classes. To address these limitations, we propose MaskDiffusion, an innovative approach that leverages pretrained frozen Stable Diffusion to achieve open-vocabulary semantic segmentation without the need for additional training or annotation, leading to improved performance compared to similar methods. We also demonstrate the superior performance of MaskDiffusion in handling open vocabularies, including fine-grained and proper noun-based categories, thus expanding the scope of segmentation applications. Overall, our MaskDiffusion shows significant qualitative and quantitative improvements in contrast to other comparable unsupervised segmentation methods, i.e. on the Potsdam dataset (+10.5 mIoU compared to GEM) and COCO-Stuff (+14.8 mIoU compared to DiffSeg). All code and data will be released at https://github.com/Valkyrja3607/MaskDiffusion.

MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation

TL;DR

The proposed MaskDiffusion is an innovative approach that leverages pretrained frozen Stable Diffusion to achieve open-vocabulary semantic segmentation without the need for additional training or annotation, leading to improved performance compared to similar methods.

Abstract

Semantic segmentation is essential in computer vision for various applications, yet traditional approaches face significant challenges, including the high cost of annotation and extensive training for supervised learning. Additionally, due to the limited predefined categories in supervised learning, models typically struggle with infrequent classes and are unable to predict novel classes. To address these limitations, we propose MaskDiffusion, an innovative approach that leverages pretrained frozen Stable Diffusion to achieve open-vocabulary semantic segmentation without the need for additional training or annotation, leading to improved performance compared to similar methods. We also demonstrate the superior performance of MaskDiffusion in handling open vocabularies, including fine-grained and proper noun-based categories, thus expanding the scope of segmentation applications. Overall, our MaskDiffusion shows significant qualitative and quantitative improvements in contrast to other comparable unsupervised segmentation methods, i.e. on the Potsdam dataset (+10.5 mIoU compared to GEM) and COCO-Stuff (+14.8 mIoU compared to DiffSeg). All code and data will be released at https://github.com/Valkyrja3607/MaskDiffusion.
Paper Structure (17 sections, 7 equations, 8 figures, 7 tables)

This paper contains 17 sections, 7 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparison of MaskDiffusion with previous method MaskCLIP maskclip on a Cityscapes cityscapes image. k-means clustering on the internal features of the Diffusion U-Net (a) shows that each determined cluster roughly partitions the image according to some classes, indicating that the semantic information is well preserved. MaskDiffusion (b) yields well-partitioned segments consistent with the shape of the object and exhibits minimal noise. In comparison, MaskCLIP maskclip (c) results in smaller and noisy segments.
  • Figure 2: High-level overview of our MaskDiffusion architecture. MaskDiffusion uses a frozen pre-trained diffusion model. The UNet is given latent images compressed by a VAE as well as text prompts embedded by CLIP. The prompts are the names of all the potential classes to be segmented. The output of each layer of the U-Net is extracted as a concatenated internal feature $\mathbf{f}$ and a cross-attention map, which are subsequently post-processed into a segmentation image.
  • Figure 3: Overview of the post processing step. First, a representative $\mathbf{f}$ is computed for each category through a weighted average of $\mathbf{f}$ based on the values of the cross-attention map. In the next step, we determine the semantic segmentation result by evaluating the cosine similarity between $\mathbf{f}$ and the representative $\mathbf{f}$ for each category and then assign each $\mathbf{f}$ to the category that has the closest similarity.
  • Figure 4: Overview of Unsupervised MaskDiffusion architecture. we employ spectral clustering spectral to include the spatial relationships of the internal features in the segmentation process.
  • Figure 5: Qualitative results. Images (a) to (c) depict scenes from PascalVOC voc, images (d) and (e) represent scenes from Cityscapes cityscapes, and image (f) shows scene from Potsdam potsdam. We compare MaskCLIP maskclip, GEM gem, and our proposed MaskDiffusion. MaskCLIP maskclip and GEM gem provide fragmented and noisy segmentations, whereas MaskDiffusion exhibits a more cohesive and accurate segmentation for each object.
  • ...and 3 more figures