Table of Contents
Fetching ...

Locality in Image Diffusion Models Emerges from Data Statistics

Artem Lukoianov, Chenyang Yuan, Justin Solomon, Vincent Sitzmann

TL;DR

This work argues that locality in image diffusion models is primarily determined by data statistics rather than neural network architecture. By linking denoising sensitivity to the data covariance, the authors show that the Wiener filter — a high-SNR projection onto principal components — closely matches the sensitivity fields learned by diverse architectures, and that locality can be nonlocal for datasets with nonstandard covariance (e.g., centered faces). They derive an analytical diffusion model that leverages high-SNR components and a data-driven masking approach, which outperforms prior patch-based methods across several datasets. The findings imply that improving generation quality hinges on capturing dataset statistics, and that controlled manipulation of data statistics can shape locality patterns and generalization in diffusion models.

Abstract

Recent work has shown that the generalization ability of image diffusion models arises from the locality properties of the trained neural network. In particular, when denoising a particular pixel, the model relies on a limited neighborhood of the input image around that pixel, which, according to the previous work, is tightly related to the ability of these models to produce novel images. Since locality is central to generalization, it is crucial to understand why diffusion models learn local behavior in the first place, as well as the factors that govern the properties of locality patterns. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset and is not due to the inductive bias of convolutional neural networks, as suggested in previous work. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to deep neural denoisers. We show, both theoretically and experimentally, that this locality arises directly from pixel correlations present in the image datasets. Moreover, locality patterns are drastically different on specialized datasets, approximating principal components of the data's covariance. We use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than prior expert-crafted alternatives. Our key takeaway is that while neural network architectures influence generation quality, their primary role is to capture locality patterns inherent in the data.

Locality in Image Diffusion Models Emerges from Data Statistics

TL;DR

This work argues that locality in image diffusion models is primarily determined by data statistics rather than neural network architecture. By linking denoising sensitivity to the data covariance, the authors show that the Wiener filter — a high-SNR projection onto principal components — closely matches the sensitivity fields learned by diverse architectures, and that locality can be nonlocal for datasets with nonstandard covariance (e.g., centered faces). They derive an analytical diffusion model that leverages high-SNR components and a data-driven masking approach, which outperforms prior patch-based methods across several datasets. The findings imply that improving generation quality hinges on capturing dataset statistics, and that controlled manipulation of data statistics can shape locality patterns and generalization in diffusion models.

Abstract

Recent work has shown that the generalization ability of image diffusion models arises from the locality properties of the trained neural network. In particular, when denoising a particular pixel, the model relies on a limited neighborhood of the input image around that pixel, which, according to the previous work, is tightly related to the ability of these models to produce novel images. Since locality is central to generalization, it is crucial to understand why diffusion models learn local behavior in the first place, as well as the factors that govern the properties of locality patterns. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset and is not due to the inductive bias of convolutional neural networks, as suggested in previous work. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to deep neural denoisers. We show, both theoretically and experimentally, that this locality arises directly from pixel correlations present in the image datasets. Moreover, locality patterns are drastically different on specialized datasets, approximating principal components of the data's covariance. We use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than prior expert-crafted alternatives. Our key takeaway is that while neural network architectures influence generation quality, their primary role is to capture locality patterns inherent in the data.

Paper Structure

This paper contains 46 sections, 5 theorems, 47 equations, 20 figures, 7 tables, 1 algorithm.

Key Result

Proposition A.2

When $X = \left\{ x_0^{i} \right\}_{i \in [N]}$ is a finite empirical distribution, the optimal denoiser $\hat{f}(x, t)$ has the following analytical expression:

Figures (20)

  • Figure 1: Left: We visualize the distribution of $x_t$ for two training data points $x_0^{(1)}$ and $x_0^{(2)}$ as high-probability-density "cones", as a function of spatial dimension $x$ and noise level $t$. Note how for a new testing point $x_0^{(\text{test})}$ there exists noise level $t'$ such that noised versions of $x_{t'}^{(\text{test})}$ are outside of any of the training "cones" and thus the behavior of the denoiser there is undefined. Middle: We take CIFAR10 test images (top) and add noise $\epsilon_{t'}$ (2nd row). With a single denoising step, a trained diffusion model $f_\theta$ "passes through" most of the coarse structure of the input image, and thus the output image is visually similar to the input (3rd row). Optimal denoiser $f^*$ instead "teleports" the image to the closest data point in the training dataset (4th row). Right: We compare MSE error of single-step denoising of $f_\theta$ (U-Net) and $f^*$ (Optimal). At low noise levels, $f_\theta$ removes noise from $x_{t'}^{(\text{test})}$ but $f^*$ predicts a different image from $x_0^{(\text{test})}$. At high noise levels, the outputs of $f_\theta$ and $f^*$ are similar.
  • Figure 2: Comparison of sensitivity fields of deep denoisers and the projection operators to high-SNR data's components (i.e. the Wiener filter) on CIFAR-10 dataset. Sensitivity is measured at the center pixel w.r.t. $x_0$ prediction and throughout a 1000-step DDIM denoising process. Each image is averaged across $32$ samples and normalized to [0,1].
  • Figure 3: Average sensitivity fields of a trained DDPM on the CelebA-HQ dataset. The top row corresponds to an output pixel located near the left eye; the bottom row corresponds to an output pixel near the image center. Left to right: different noise levels corresponding to $t$ of $600$, $400$, $200$.
  • Figure 4: We slightly manipulate pixels' correlations across the CIFAR-10 dataset such that a desired pattern emerges in the sensitivity of a trained diffusion model. In particular, a DDPM diffusion model trained on the CIFAR-10 dataset (sample on the top left) has a coarse-to-fine sensitivity field (top row, noise level decreases from left to right). For each image in the dataset, we edit pixel correlations by adding the desired pattern with random color and weights $\gamma = 0.1$ (middle row) and $\gamma = 0.5$ (bottom row). DDPM models trained on those manipulated datasets exhibit the pattern in their sensitivity fields. We underscore the time-steps for which $SNR_W > 0.1$, i.e. $\lambda^2_W > 0.1\sigma^2_t$. This supports our claim that the locality in diffusion models arises not from the inductive bias (i.e. usage of convolutional layers) but from the data statistics.
  • Figure 5: Qualitative comparison. In this figure, we compare our analytical model (3rd row) with multiple baselines: Wiener filter (4th row), Kamb and Ganguli kamb2024analytic analytical model (5th row). All images are generated with the same initial noise sample with 10 steps of DDIM song2020denoising. In the top row, we provide the results of generation with two trained neural networks, NN1 and NN2 -- both are instances of the same DDPM U-Net ho2020denoising, but trained with different seeds. The distance in \ref{['tab:metrics-comparison']} is measured with respect to NN1. In the last row we provide the nearest image from the dataset for our final generation w.r.t. L2 distance.
  • ...and 15 more figures

Theorems & Definitions (16)

  • Definition A.1
  • Proposition A.2
  • proof
  • Definition A.3
  • Proposition A.4
  • proof
  • Remark A.5
  • Definition A.6
  • Proposition A.7: Patch‐based optimal denoiser
  • proof : Proof of Patch-based optimal denoiser
  • ...and 6 more