Table of Contents
Fetching ...

A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data

Antonio Sclocchi, Alessandro Favero, Matthieu Wyart

TL;DR

This work reveals a phase transition in the backward diffusion dynamics that separates high-level class reconstruction from smooth, multi-scale changes in low-level features, elucidating the hierarchical, compositional structure of real data. By modeling data with a tree-like Random Hierarchy Model and solving Bayes-optimal denoising with Belief Propagation, the authors predict a sharp transition in class reconstruction while lower-level features evolve gradually. Mean-field theory corroborates these predictions and aligns with empirical findings on ImageNet diffusion models and CNN activations. The results position diffusion models as powerful probes of data structure, offering a principled lens to study hierarchical generative processes and their implications for learning and generalization.

Abstract

Understanding the structure of real data is paramount in advancing modern deep-learning methodologies. Natural data such as images are believed to be composed of features organized in a hierarchical and combinatorial manner, which neural networks capture during learning. Recent advancements show that diffusion models can generate high-quality images, hinting at their ability to capture this underlying compositional structure. We study this phenomenon in a hierarchical generative model of data. We find that the backward diffusion process acting after a time $t$ is governed by a phase transition at some threshold time, where the probability of reconstructing high-level features, like the class of an image, suddenly drops. Instead, the reconstruction of low-level features, such as specific details of an image, evolves smoothly across the whole diffusion process. This result implies that at times beyond the transition, the class has changed, but the generated sample may still be composed of low-level elements of the initial image. We validate these theoretical insights through numerical experiments on class-unconditional ImageNet diffusion models. Our analysis characterizes the relationship between time and scale in diffusion models and puts forward generative models as powerful tools to model combinatorial data properties.

A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data

TL;DR

This work reveals a phase transition in the backward diffusion dynamics that separates high-level class reconstruction from smooth, multi-scale changes in low-level features, elucidating the hierarchical, compositional structure of real data. By modeling data with a tree-like Random Hierarchy Model and solving Bayes-optimal denoising with Belief Propagation, the authors predict a sharp transition in class reconstruction while lower-level features evolve gradually. Mean-field theory corroborates these predictions and aligns with empirical findings on ImageNet diffusion models and CNN activations. The results position diffusion models as powerful probes of data structure, offering a principled lens to study hierarchical generative processes and their implications for learning and generalization.

Abstract

Understanding the structure of real data is paramount in advancing modern deep-learning methodologies. Natural data such as images are believed to be composed of features organized in a hierarchical and combinatorial manner, which neural networks capture during learning. Recent advancements show that diffusion models can generate high-quality images, hinting at their ability to capture this underlying compositional structure. We study this phenomenon in a hierarchical generative model of data. We find that the backward diffusion process acting after a time is governed by a phase transition at some threshold time, where the probability of reconstructing high-level features, like the class of an image, suddenly drops. Instead, the reconstruction of low-level features, such as specific details of an image, evolves smoothly across the whole diffusion process. This result implies that at times beyond the transition, the class has changed, but the generated sample may still be composed of low-level elements of the initial image. We validate these theoretical insights through numerical experiments on class-unconditional ImageNet diffusion models. Our analysis characterizes the relationship between time and scale in diffusion models and puts forward generative models as powerful tools to model combinatorial data properties.
Paper Structure (36 sections, 55 equations, 16 figures)

This paper contains 36 sections, 55 equations, 16 figures.

Figures (16)

  • Figure 1: Illustration of forward-backward experiments. Images generated by a denoising diffusion probabilistic model starting from the top-left image and inverting the dynamics at different times $t$. $T$ corresponds to the time scale when the forward diffusion process converges to an isotropic Gaussian distribution. At small $t$, the class of the generated image remains unchanged, with only alterations of low-level features, such as the eyes of the leopard. After a characteristic time $t$, the class undergoes a phase transition and changes. However, some low-level attributes of the original image are retained to compose the new image. For instance, the wolf is composed of eyes, nose, and ears similar to those of the leopard, and the butterfly inherits its colors and black spots.
  • Figure 2: Left panel.Examples of images generated by reverting the diffusion process at different times $t$. Starting from the left images $x_0$ at time $t=0$, we generate samples $\hat{x}_{0}(t)\sim p_\theta(\hat{x}_{0}|x_t)$ by first running the diffusion process up to time $t$ and then reverting it, as described in \ref{['sec:forward-backward-exp']}. At time $t=T$, $x_T$ corresponds to isotropic Gaussian noise and the generated image $\hat{x}_{0}(T)$ is uncorrelated from $x_0$. At intermediate times, instead, a sudden change of the image class is observed, while some lower-level features are retained. Right panel.Cosine similarity between the post-activations of the hidden layers of a ConvNeXt Base liu2022convnet for the initial images $x_0$ and the synthesized ones $\hat{x}_{0}(t)$. Around $t \approx T/2$, the similarity between logits exhibits a sharp drop, indicating the change in class, while the hidden representations of the first layers change more smoothly. This indicates that certain low-level features from the original images are retained for composing the sampled images also after the class transition. To compute the cosine similarity, all activations are standardized, i.e., centered around the mean and scaled by the standard deviation computed on the 50000 images of the ImageNet-1k validation set. At each time, the values of the cosine similarity correspond to the maximum of their empirical distribution over $10000$ images ($10$ per class of ImageNet-1k).
  • Figure 3: Sketch of the hierarchical and compositional structure of data.Left panel: The leopard in the image can be iteratively decomposed in features at different levels of abstraction. Right panel: Generative hierarchical model we study in this paper. In this example, depth $L=3$ and branching factor $s=2$. Different values of the input and latent variables are represented with different colors.
  • Figure 4: Illustration of the flow of messages in the Belief Propagation algorithm for the case $s=2$, $L=2$ of the Random Hierarchy Model. The factor nodes (squares) represent the rules that connect the variables at different levels of the hierarchy. The downward process is represented only for the leftmost branch.
  • Figure 5: Probability that the latent has not changed in the denoising process, corresponding to the largest marginal probability computed by BP, averaged for each layer, for varying inversion times of the diffusion process $t$. Data for the RHM with $v=32$, $m=8$, $s=2$, $L=10$. Each level of the tree, indicated in the legend, is represented with a different color. We observe the same behavior of the curves for ImageNet data in \ref{['fig:imagenet-main']}: the probability of the correct class has a sharp transition at a characteristic time scale, while the probabilities corresponding to latent variables in the lower levels change smoothly.
  • ...and 11 more figures