EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

Koichi Namekata; Amirmojtaba Sabour; Sanja Fidler; Seung Wook Kim

EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

Koichi Namekata, Amirmojtaba Sabour, Sanja Fidler, Seung Wook Kim

TL;DR

This paper investigates whether pre-trained diffusion models encode fine-grained pixel-level semantic knowledge by introducing EmerDiff, an unsupervised image segmentor built solely from Stable Diffusion signals. It constructs low-resolution segmentation maps from semantically meaningful 16×16 cross-attention features and links them to high-resolution pixels through modulated denoising, enabling pixel-level labeling without annotations. Experiments across COCO-Stuff, ADE20K, PASCAL-Context, and Cityscapes demonstrate well-delineated, fine-grained segmentation maps and competitive performance with open-vocabulary and unsupervised baselines, including improvements when fused with text-aware segmentation methods. The work suggests that diffusion models contain rich pixel-level semantic knowledge and opens avenues for annotation-free segmentation and broader discriminative tasks using generative priors.

Abstract

Diffusion models have recently received increasing research attention for their remarkable transfer abilities in semantic segmentation tasks. However, generating fine-grained segmentation masks with diffusion models often requires additional training on annotated datasets, leaving it unclear to what extent pre-trained diffusion models alone understand the semantic relations of their generated images. To address this question, we leverage the semantic knowledge extracted from Stable Diffusion (SD) and aim to develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. The primary difficulty stems from the fact that semantically meaningful feature maps typically exist only in the spatially lower-dimensional layers, which poses a challenge in directly extracting pixel-level semantic relations from these feature maps. To overcome this issue, our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps by exploiting SD's generation process and utilizes them for constructing image-resolution segmentation maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images, indicating the existence of highly accurate pixel-level semantic knowledge in diffusion models.

EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

TL;DR

Abstract

Paper Structure (16 sections, 18 figures, 9 tables)

This paper contains 16 sections, 18 figures, 9 tables.

Introduction
Related works
Methods
Preliminaries
Constructing low-resolution segmentation maps
Building image-resolution segmentation maps
Experiments
Implementation details
Qualitative analysis
Quantitative analysis
Limitation and Conclusion
Additional Implementation Details
Additional details of Unsupervised Semantic Segmentation
Additional details of open-vocabulary semantic segmentation
Hyperparameter Analysis
...and 1 more sections

Figures (18)

Figure 1: EmerDiff is an unsupervised image segmentor solely built on the semantic knowledge extracted from a pre-trained diffusion model. The obtained fine-detailed segmentation maps suggest the presence of highly accurate pixel-level semantic knowledge in diffusion models.
Figure 2: Overview of our framework.green: we first construct low-resolution segmentation maps by applying k-means on semantically meaningful low-dimensional feature maps. orange: Next, we generate image-resolution segmentation maps by mapping each pixel to the most semantically corresponding low-resolution mask, where semantic correspondences are identified by the modulated denoising process.
Figure 3: Visualization of modulated denoising process. First row: original image. Second row: low-resolution modulation mask $M \in \{0, 1\}^{h\times w}$. Third row: obtained difference map $d \in \mathbb{R}^{H\times W}$, where $H/h=W/w=32$
Figure 4: Qualitative comparison with naively upsampled low-resolution segmentation maps.
Figure 5: Varying the number of segmentation masks. Our framework consistently groups objects in a semantically meaningful manner.
...and 13 more figures

EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

TL;DR

Abstract

EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (18)