Unconditional CNN denoisers contain sparse semantic representation of images
Zahra Kadkhodaie, Stéphane Mallat, Eero Simoncelli
TL;DR
This work probes the internal representations of unconditional diffusion denoisers, showing that a fully convolutional UNet learns a sparse, semantically meaningful representation in its middle block, captured by $\phi(x_\sigma)=\mathbb{E}_{z}[\bar a_4(x+\sigma z)]$. The representation lies in a union of subspaces with two channel types (selective and non-selective), and distances in this space correlate with semantic similarity, enabling unsupervised clustering that reflects scene gist rather than object labels. A novel self-guided stochastic reconstruction algorithm samples from $p(x|\phi)$ by alternating score-based denoising with a gradient projection to match $\phi$; the resulting conditional samples reveal both common structure and diversity encoded by the representation. These findings illuminate how high-level semantic information can emerge purely from a denoising objective, with potential implications for understanding diffusion models and guiding conditional generation without explicit labels.
Abstract
Generative diffusion models learn probability densities over diverse image datasets by estimating the score with a neural network trained to remove noise. Despite their remarkable success in generating high-quality images, the internal mechanisms of the underlying score networks are not well understood. Here, we examine the image representation that arises from score estimation in a {fully-convolutional unconditional UNet}. We show that the middle block of the UNet decomposes individual images into sparse subsets of active channels, and that the vector of spatial averages of these channels can provide a nonlinear representation of the underlying clean images. Euclidean distances in this representation space are semantically meaningful, even though no conditioning information is provided during training. We develop a novel algorithm for stochastic reconstruction of images conditioned on this representation: The synthesis using the unconditional model is "self-guided" by the representation extracted from that very same model. For a given representation, the common patterns in the set of reconstructed samples reveal the features captured in the middle block of the UNet. Together, these results show, for the first time, that a measure of semantic similarity emerges, unsupervised, solely from the denoising objective.
