Locality in Image Diffusion Models Emerges from Data Statistics

Artem Lukoianov; Chenyang Yuan; Justin Solomon; Vincent Sitzmann

Locality in Image Diffusion Models Emerges from Data Statistics

Artem Lukoianov, Chenyang Yuan, Justin Solomon, Vincent Sitzmann

TL;DR

This work argues that locality in image diffusion models is primarily determined by data statistics rather than neural network architecture. By linking denoising sensitivity to the data covariance, the authors show that the Wiener filter — a high-SNR projection onto principal components — closely matches the sensitivity fields learned by diverse architectures, and that locality can be nonlocal for datasets with nonstandard covariance (e.g., centered faces). They derive an analytical diffusion model that leverages high-SNR components and a data-driven masking approach, which outperforms prior patch-based methods across several datasets. The findings imply that improving generation quality hinges on capturing dataset statistics, and that controlled manipulation of data statistics can shape locality patterns and generalization in diffusion models.

Abstract

Recent work has shown that the generalization ability of image diffusion models arises from the locality properties of the trained neural network. In particular, when denoising a particular pixel, the model relies on a limited neighborhood of the input image around that pixel, which, according to the previous work, is tightly related to the ability of these models to produce novel images. Since locality is central to generalization, it is crucial to understand why diffusion models learn local behavior in the first place, as well as the factors that govern the properties of locality patterns. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset and is not due to the inductive bias of convolutional neural networks, as suggested in previous work. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to deep neural denoisers. We show, both theoretically and experimentally, that this locality arises directly from pixel correlations present in the image datasets. Moreover, locality patterns are drastically different on specialized datasets, approximating principal components of the data's covariance. We use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than prior expert-crafted alternatives. Our key takeaway is that while neural network architectures influence generation quality, their primary role is to capture locality patterns inherent in the data.

Locality in Image Diffusion Models Emerges from Data Statistics

TL;DR

Abstract

Locality in Image Diffusion Models Emerges from Data Statistics

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (16)