Table of Contents
Fetching ...

Unconditional CNN denoisers contain sparse semantic representation of images

Zahra Kadkhodaie, Stéphane Mallat, Eero Simoncelli

TL;DR

This work probes the internal representations of unconditional diffusion denoisers, showing that a fully convolutional UNet learns a sparse, semantically meaningful representation in its middle block, captured by $\phi(x_\sigma)=\mathbb{E}_{z}[\bar a_4(x+\sigma z)]$. The representation lies in a union of subspaces with two channel types (selective and non-selective), and distances in this space correlate with semantic similarity, enabling unsupervised clustering that reflects scene gist rather than object labels. A novel self-guided stochastic reconstruction algorithm samples from $p(x|\phi)$ by alternating score-based denoising with a gradient projection to match $\phi$; the resulting conditional samples reveal both common structure and diversity encoded by the representation. These findings illuminate how high-level semantic information can emerge purely from a denoising objective, with potential implications for understanding diffusion models and guiding conditional generation without explicit labels.

Abstract

Generative diffusion models learn probability densities over diverse image datasets by estimating the score with a neural network trained to remove noise. Despite their remarkable success in generating high-quality images, the internal mechanisms of the underlying score networks are not well understood. Here, we examine the image representation that arises from score estimation in a {fully-convolutional unconditional UNet}. We show that the middle block of the UNet decomposes individual images into sparse subsets of active channels, and that the vector of spatial averages of these channels can provide a nonlinear representation of the underlying clean images. Euclidean distances in this representation space are semantically meaningful, even though no conditioning information is provided during training. We develop a novel algorithm for stochastic reconstruction of images conditioned on this representation: The synthesis using the unconditional model is "self-guided" by the representation extracted from that very same model. For a given representation, the common patterns in the set of reconstructed samples reveal the features captured in the middle block of the UNet. Together, these results show, for the first time, that a measure of semantic similarity emerges, unsupervised, solely from the denoising objective.

Unconditional CNN denoisers contain sparse semantic representation of images

TL;DR

This work probes the internal representations of unconditional diffusion denoisers, showing that a fully convolutional UNet learns a sparse, semantically meaningful representation in its middle block, captured by . The representation lies in a union of subspaces with two channel types (selective and non-selective), and distances in this space correlate with semantic similarity, enabling unsupervised clustering that reflects scene gist rather than object labels. A novel self-guided stochastic reconstruction algorithm samples from by alternating score-based denoising with a gradient projection to match ; the resulting conditional samples reveal both common structure and diversity encoded by the representation. These findings illuminate how high-level semantic information can emerge purely from a denoising objective, with potential implications for understanding diffusion models and guiding conditional generation without explicit labels.

Abstract

Generative diffusion models learn probability densities over diverse image datasets by estimating the score with a neural network trained to remove noise. Despite their remarkable success in generating high-quality images, the internal mechanisms of the underlying score networks are not well understood. Here, we examine the image representation that arises from score estimation in a {fully-convolutional unconditional UNet}. We show that the middle block of the UNet decomposes individual images into sparse subsets of active channels, and that the vector of spatial averages of these channels can provide a nonlinear representation of the underlying clean images. Euclidean distances in this representation space are semantically meaningful, even though no conditioning information is provided during training. We develop a novel algorithm for stochastic reconstruction of images conditioned on this representation: The synthesis using the unconditional model is "self-guided" by the representation extracted from that very same model. For a given representation, the common patterns in the set of reconstructed samples reveal the features captured in the middle block of the UNet. Together, these results show, for the first time, that a measure of semantic similarity emerges, unsupervised, solely from the denoising objective.

Paper Structure

This paper contains 19 sections, 7 equations, 21 figures, 2 algorithms.

Figures (21)

  • Figure 1: Channel sparsity of input and output layers of a UNet trained on ImageNet. Histograms show participation ratios (PR), of the spatially averaged input channels (orange) and output channels (green) for individual blocks. The middle block and decoder blocks exhibit increases in sparsity (i.e. reduction in PR). Blocks $\{ E_1, D_1\}$ are not included since they have only one input/output channel, receptively. This is evidence that encoder blocks extract features to isolate noise and signal, and middle block and decoder blocks preserve those channels containing signal while suppressing those containing noise. (Notation: encoder blocks $\{E_k\}$, middle block ($M$), decoder blocks $\{D_k\}$, downsampling ($d$), upsampling ($u$),"skip" connections.) See \ref{['fig:representation-sparsity-texture']} for other models.
  • Figure 2: Stability of $\bar{a}$ across noise levels, for different network blocks of a model trained on ImageNet64. Plots show cosine similarity of $\bar{a}(x_{\sigma_1})$ and $\bar{a}(x_{\sigma_2})$, for $\sigma_1 = 0.5$, as a function of $\sigma_2$. $\bar{a}$ is most stable in the middle block (M). Note that $\bar{a}$ collapses as $\sigma$ falls to zero, for which the denoiser should compute the identity function. See \ref{['fig:noise-level-dependency-unet-texture']} for other models.
  • Figure 3: Channel selectivity. Left: Participation ratios for each channel over ImageNet. Distribution is bimodal, corresponding to channels that are highly specialized (and infrequently active) on left, and commonly used on right. Right: The panels show the set of images that maximally activate each of four specialized channels, revealing selectivity for rectangular periodic lattices (PR$=0.19$), a bird on a branch ($0.18$), cylindrical objects($0.19$), and dog faces ($0.26$). See \ref{['fig:channel-selectivity-texture']} for other models.
  • Figure 4: Specialized channels capture visual attributes and composition of an image. Left: Example image that activates several specialized channels. Right: Each panel shows the set of images that maximally activate one of the specialized channels activated by the example image, corresponding to images of people, periodic texture patterns, and images with left-right reflective symmetry. All three elements are present in the example image.
  • Figure 5: Union of subspaces. Left: Two sets of images whose $\phi$'s lie on two subspaces. Middle/Right: Three components of the $\phi$ vectors (out of the 512) for these images. The vertical axis corresponds to a common channel, while the other two correspond to specialized channels, each selective for only one image cluster. As a result, the $\phi$ vectors lie on a union of two-dimensional subspaces in the displayed three dimensional ambient space.
  • ...and 16 more figures