Table of Contents
Fetching ...

An analytic theory of creativity in convolutional diffusion models

Mason Kamb, Surya Ganguli

TL;DR

This work derives an analytic, interpretable theory of creativity in convolutional diffusion models by focusing on two fundamental inductive biases: locality and translational equivariance. The authors formulate three progressively expressive score machines—Equivariant Score (ES), Local Score (LS), and Equivariant Local Score (ELS)—and, with boundary considerations, show how these lead to locally consistent patch mosaics that generate novel images beyond memorized training examples. They prove convergence properties and validate the theory by case-by-case predictions against trained CNN diffusion models (ResNet/UNet) on MNIST, FashionMNIST, CIFAR10, and CelebA, achieving high $r^2$ values (e.g., medians around 0.9–0.96) and revealing the role of boundaries and attention in shaping outputs. The results illuminate a patch-mosaic mechanism of creativity, quantify the impact of locality, and provide a principled bridge to understanding attention-enabled diffusion models, with potential practical implications for interpretability and generation quality.

Abstract

We obtain an analytic, interpretable and predictive theory of creativity in convolutional diffusion models. Indeed, score-matching diffusion models can generate highly original images that lie far from their training data. However, optimal score-matching theory suggests that these models should only be able to produce memorized training examples. To reconcile this theory-experiment gap, we identify two simple inductive biases, locality and equivariance, that: (1) induce a form of combinatorial creativity by preventing optimal score-matching; (2) result in fully analytic, completely mechanistically interpretable, local score (LS) and equivariant local score (ELS) machines that, (3) after calibrating a single time-dependent hyperparameter can quantitatively predict the outputs of trained convolution only diffusion models (like ResNets and UNets) with high accuracy (median $r^2$ of $0.95, 0.94, 0.94, 0.96$ for our top model on CIFAR10, FashionMNIST, MNIST, and CelebA). Our model reveals a locally consistent patch mosaic mechanism of creativity, in which diffusion models create exponentially many novel images by mixing and matching different local training set patches at different scales and image locations. Our theory also partially predicts the outputs of pre-trained self-attention enabled UNets (median $r^2 \sim 0.77$ on CIFAR10), revealing an intriguing role for attention in carving out semantic coherence from local patch mosaics.

An analytic theory of creativity in convolutional diffusion models

TL;DR

This work derives an analytic, interpretable theory of creativity in convolutional diffusion models by focusing on two fundamental inductive biases: locality and translational equivariance. The authors formulate three progressively expressive score machines—Equivariant Score (ES), Local Score (LS), and Equivariant Local Score (ELS)—and, with boundary considerations, show how these lead to locally consistent patch mosaics that generate novel images beyond memorized training examples. They prove convergence properties and validate the theory by case-by-case predictions against trained CNN diffusion models (ResNet/UNet) on MNIST, FashionMNIST, CIFAR10, and CelebA, achieving high values (e.g., medians around 0.9–0.96) and revealing the role of boundaries and attention in shaping outputs. The results illuminate a patch-mosaic mechanism of creativity, quantify the impact of locality, and provide a principled bridge to understanding attention-enabled diffusion models, with potential practical implications for interpretability and generation quality.

Abstract

We obtain an analytic, interpretable and predictive theory of creativity in convolutional diffusion models. Indeed, score-matching diffusion models can generate highly original images that lie far from their training data. However, optimal score-matching theory suggests that these models should only be able to produce memorized training examples. To reconcile this theory-experiment gap, we identify two simple inductive biases, locality and equivariance, that: (1) induce a form of combinatorial creativity by preventing optimal score-matching; (2) result in fully analytic, completely mechanistically interpretable, local score (LS) and equivariant local score (ELS) machines that, (3) after calibrating a single time-dependent hyperparameter can quantitatively predict the outputs of trained convolution only diffusion models (like ResNets and UNets) with high accuracy (median of for our top model on CIFAR10, FashionMNIST, MNIST, and CelebA). Our model reveals a locally consistent patch mosaic mechanism of creativity, in which diffusion models create exponentially many novel images by mixing and matching different local training set patches at different scales and image locations. Our theory also partially predicts the outputs of pre-trained self-attention enabled UNets (median on CIFAR10), revealing an intriguing role for attention in carving out semantic coherence from local patch mosaics.
Paper Structure (30 sections, 3 theorems, 49 equations, 24 figures, 1 table)

This paper contains 30 sections, 3 theorems, 49 equations, 24 figures, 1 table.

Key Result

Theorem 4.1

For the LS, ELS, and boundary broken ELS machines, assuming $\lim_{t \to 0} \phi_t$ and $\lim_{t \to 0} \partial_t\phi_t$ exist, then for every pixel $x$, $\phi_0(x) = \varphi(0)$ for the unique patch $\varphi \in P_{\Omega}^x(\mathcal{D})$ for which $\phi_{\Omega_x}$ is closer in $L_2$ distance (in

Figures (24)

  • Figure 1: Our analytic theory (left columns) can accurately predict on a case by case basis the outputs of convolutional diffusion models (right columns), with UNet or ResNet architectures trained on MNIST, CIFAR10, FashionMNIST, and CelebA (left to right), even when these outputs are highly original and far from the training data. See Fig. \ref{['fig:enter-label']}, App. \ref{['app:empirics']}, Fig. \ref{['fig:a1']} and Table \ref{['tab:correlation_results']}, and App. \ref{['appendix:samples']}, Fig. \ref{['fig:resnet-mnist-zeros']} to Fig. \ref{['fig:cifar10-zeros-attention']} for many more successful theory-experiment comparisons.
  • Figure 2: Ideal score-matching under various constraints. (a) In the IS machine, the entire image (bottom) reverse flows to a single training set image from the training set (top stack). (b,c) In both the LS and ELS machines, different local patches of the image flow to different local patches in the training set. In the LS machine this final training patch must be drawn from the same location (b), while in the ELS machine, it can be drawn from any location (c).
  • Figure 3: Exponential creativity through locally consistent patch mosaics. (a) A training set of two images (all black or all white). (b) Original samples from any local score machine (LS or ELS) with a $3\times3$ locality window and periodic boundary conditions. Local consistency in this special case means every generated pixel is either black or white, and the majority color of every generated $3\times3$ patch equals the color of its central pixel. (c) We note that samples are generated by numerically integrating the reverse flow in \ref{['eq:reverseflow']}. If the step size in this integration is too large, one can generate invalid samples with a few cases of broken local consistency (highlighted red patches). In practice in trained diffusion models, this local consistency would only hold approximately.
  • Figure 4: Coarse to fine progression of spatial locality in the reverse flow. (a) A heatmap of the average absolute value of the Jacobian from the output score $M_t[\phi_t](x=0)$ at the center pixel $x=0$ back to all input pixels $\phi(x')$ as a function of $x'$. This receptive field shrinks from large to small as time progresses from early (large $t$) to late (small $t$) in the reverse flow. (b) Optimally calibrated values of the spatial locality scale $P$ of the (E)LS machine as a function of time $t$ (see App. \ref{['sec:multiscale']} for details of calibration). (c) A schematic view of the time-dependent LS and ELS machines in which the locality neighborhood shrinks as the reverse time flows from top to bottom.
  • Figure 5: Match between theory and experiment. (a,b) Each pair of images shows a striking match between the output of the boundary broken ELS machine (left image in each pair) and the output of a trained CNN diffusion model (right image in each pair) when both models are given the same initial noise input. We compare theory with $2$ architectures (ResNet in (a), and UNet in (b)) on $3$ datasets (MNIST, CIFAR10 and FashionMNIST from top to bottom). See App. \ref{['appendix:samples']}, Fig. \ref{['fig:resnet-mnist-zeros']} to Fig. \ref{['fig:cifar10-circular-resnet']} for many comparisons and Fig. \ref{['fig:a1']} and Table \ref{['tab:correlation_results']} for quantitative $r^2$ values indicating high match between theory and experiment. (c) Trained CNN diffusion models (right) produce well-known spatial inconsistencies (e.g. 3 legged pants (row 1,4), 3 armed tops (row 3,6), bifurcated shoes (row 2,5)). Remarkably, the ELS theory (left) predicts this behavior and mechanistically explains it through excessive spatial locality at late times in the reverse flow.
  • ...and 19 more figures

Theorems & Definitions (8)

  • Definition 3.1
  • Definition 3.2
  • Theorem 4.1
  • Definition 2.1
  • Theorem 2.2
  • proof
  • Theorem 2.3
  • proof