Table of Contents
Fetching ...

The Entropic Signature of Class Speciation in Diffusion Models

Florian Handke, Dejan Stančević, Felix Koulischer, Thomas Demeester, Luca Ambrogioni

TL;DR

The paper tackles how semantic structure emerges during diffusion-based generation by introducing the class-conditional entropy $H[Z|X_t]$ as an online diagnostic of semantic speciation. It provides a theoretical analysis in high-dimensional Gaussian mixtures, identifying a logarithmic speciation time scale for VP diffusion and showing that entropy production concentrates around this window, while VE/EDM lacks a sharp transition. The approach is validated on EDM2-XS and Stable Diffusion 1.5, demonstrating that partitioned entropy isolates noise regimes where specific semantic features arise and reveals how guidance redistributes semantic information over time. These results bridge information-theoretic and dynamical perspectives on diffusion and offer a practical framework for time-localized, feature-specific control of generative models.

Abstract

Diffusion models do not recover semantic structure uniformly over time. Instead, samples transition from semantic ambiguity to class commitment within a narrow regime. Recent theoretical work attributes this transition to dynamical instabilities along class-separating directions, but practical methods to detect and exploit these windows in trained models are still limited. We show that tracking the class-conditional entropy of a latent semantic variable given the noisy state provides a reliable signature of these transition regimes. By restricting the entropy to semantic partitions, the entropy can furthermore resolve semantic decisions at different levels of abstraction. We analyze this behavior in high-dimensional Gaussian mixture models and show that the entropy rate concentrates on the same logarithmic time scale as the speciation symmetry-breaking instability previously identified in variance-preserving diffusion. We validate our method on EDM2-XS and Stable Diffusion 1.5, where class-conditional entropy consistently isolates the noise regimes critical for semantic structure formation. Finally, we use our framework to quantify how guidance redistributes semantic information over time. Together, these results connect information-theoretic and statistical physics perspectives on diffusion and provide a principled basis for time-localized control.

The Entropic Signature of Class Speciation in Diffusion Models

TL;DR

The paper tackles how semantic structure emerges during diffusion-based generation by introducing the class-conditional entropy as an online diagnostic of semantic speciation. It provides a theoretical analysis in high-dimensional Gaussian mixtures, identifying a logarithmic speciation time scale for VP diffusion and showing that entropy production concentrates around this window, while VE/EDM lacks a sharp transition. The approach is validated on EDM2-XS and Stable Diffusion 1.5, demonstrating that partitioned entropy isolates noise regimes where specific semantic features arise and reveals how guidance redistributes semantic information over time. These results bridge information-theoretic and dynamical perspectives on diffusion and offer a practical framework for time-localized, feature-specific control of generative models.

Abstract

Diffusion models do not recover semantic structure uniformly over time. Instead, samples transition from semantic ambiguity to class commitment within a narrow regime. Recent theoretical work attributes this transition to dynamical instabilities along class-separating directions, but practical methods to detect and exploit these windows in trained models are still limited. We show that tracking the class-conditional entropy of a latent semantic variable given the noisy state provides a reliable signature of these transition regimes. By restricting the entropy to semantic partitions, the entropy can furthermore resolve semantic decisions at different levels of abstraction. We analyze this behavior in high-dimensional Gaussian mixture models and show that the entropy rate concentrates on the same logarithmic time scale as the speciation symmetry-breaking instability previously identified in variance-preserving diffusion. We validate our method on EDM2-XS and Stable Diffusion 1.5, where class-conditional entropy consistently isolates the noise regimes critical for semantic structure formation. Finally, we use our framework to quantify how guidance redistributes semantic information over time. Together, these results connect information-theoretic and statistical physics perspectives on diffusion and provide a principled basis for time-localized control.
Paper Structure (32 sections, 49 equations, 16 figures, 2 tables, 1 algorithm)

This paper contains 32 sections, 49 equations, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: Overview of entropy production in generative diffusion. (a) Entropy production quantified as the the temporal derivative of the conditional entropy during the denoising process The dashed curve shows the class-conditional entropy production over the full label space, capturing semantic commitment at the level of the complete class variable. The solid curve shows the partitioned class-conditional entropy production for the binary decision between class 0 and class 1, isolating when this specific semantic distinction is resolved. (b) Entropy production in a real dataset. Each curve shows the partitioned class-conditional entropy production for a single class against its complement. The lower panel displays representative noiseless predictions along the denoising trajectory for the highlighted classes, illustrating how semantic structure emerges around the corresponding entropy production peaks.
  • Figure 2: Class-conditional entropies $\mathbf{H}[Z\mid X_t]$ for equiprobable two-component Gaussian mixture on different time scales for VP and VE kernels for several values of $d$. In (a) and (b) the conditional entropy using a VP and VE kernel respectively (i.e., $\alpha_t=e^{-t}$, $\sigma_t^2=1-e^{-2t}$ and $\alpha_t=1$, $\sigma_t^2=\sigma_t^2$) in natural time $t$. In (c) the conditional entropy using a VP kernel in a time scale rescaled by $t_s=\frac{1}{2}\log d$. The vertical dotted lines indicate the speciation time $t_s$ given by $\ln{d}/2$ and $\sqrt{d}$ in VP and VE respectively, while the horizontal dotted line indicates convergence on the rescaled axis $u=t/t_s$.
  • Figure 3: Overview of information distortion caused by optimal guidance on ImageNet. (Top) Class-conditional entropy production profiles when guidance with scale $\omega$ is applied within the gray interval. From left to right, intervals and guidance scales are optimized with respect to FD$_{\text{DINOv2}}$ (limited interval), FID (limited interval), and FD$_{\text{DINOv2}}$ with guidance applied throughout the full denoising trajectory. (Bottom) Difference between the guided entropy production $\dot H_\omega$ and the unguided baseline $\dot H$ (Figure \ref{['fig: entropy_overview']}), summarized by the median and the 25th and 75th percentiles across classes. Green bars indicate an increase in entropy production relative to the baseline, while red bars indicate a reduction.
  • Figure 4: Entropy profiles for binary partitions of the form: “A wooden chair” vs. “A wooden chair + attribute”. (Left) Profiles computed along the mixture distribution show that low-frequency changes (e.g., color) exhibit sharper entropy decay at higher noise levels. (Right) Samples with shared initial conditions confirm earlier generation of low-frequency details (blue walls) versus high-frequency elements (cat), with semantic commitment coinciding with entropy collapse.
  • Figure 5: Class-conditional entropy $H[Z\mid X_t]$ against time $t$ for equiprobable two-component Gaussian mixtures, for the EDM (left) and VP (right) forward SDEs and several values of $d$. Shaded vertical regions represent time intervals for which entropy lies between values of $0.4$ and $0.6$, while vertical dotted lines represent the speciation times for different $d$. In the EDM case the transition region broadens with $d$, whereas in the VP case it remains approximately constant in width.
  • ...and 11 more figures