Table of Contents
Fetching ...

Diffusion Models are Kelly Gamblers

Akhil Premkumar

TL;DR

This work reframes diffusion models through an information-theoretic lens, likening their probability-shaping process to Kelly gambling and identifying two main information budgets: neural entropy (within X) and mutual information linking X to conditioning signals Y. It introduces entropy-matching as a principled way to quantify the information needed to reverse diffusion and recover data distributions, distinguishing intra-X correlations (TC(X)) from X–Y dependencies (I(X;Y)). Through joint Gaussian and image experiments, it shows that neural entropy dominates the information budget in images, while I(X;Y) remains a smaller, augmenting quantity; CFG can boost MI but eventually saturates and may distort the learned distribution. The diffusion-autoencoder framework further reveals a hierarchy where perceptual details and semantic structure are captured at different diffusion-time scales, clarifying why conditioning signals often have limited influence on raw image generation. Overall, the paper connects diffusion, manifold structure, and optimal betting into a cohesive picture of how information is stored, used, and limited in modern generative models, with practical implications for improving conditional generation and guidance strategies.

Abstract

We draw a connection between diffusion models and the Kelly criterion for maximizing returns in betting games. A signal that is correlated with the outcome of such a game can be used to focus the bets on a narrow range of high probability predictions. Diffusion models share the same paradigm in that they gradually concentrate the probability mass to fit the training data. We show that the information stored in an unconditional diffusion model captures, in part, the joint correlation between the components of the data variable $X$. Conditional diffusion models store additional information to bind the signal $X$ with the conditioning information $Y$, equal to the mutual information between them. The latter is only a small fraction of the total information in the neural network if the data is low-dimensional. We examine why this does not hinder conditional generation.

Diffusion Models are Kelly Gamblers

TL;DR

This work reframes diffusion models through an information-theoretic lens, likening their probability-shaping process to Kelly gambling and identifying two main information budgets: neural entropy (within X) and mutual information linking X to conditioning signals Y. It introduces entropy-matching as a principled way to quantify the information needed to reverse diffusion and recover data distributions, distinguishing intra-X correlations (TC(X)) from X–Y dependencies (I(X;Y)). Through joint Gaussian and image experiments, it shows that neural entropy dominates the information budget in images, while I(X;Y) remains a smaller, augmenting quantity; CFG can boost MI but eventually saturates and may distort the learned distribution. The diffusion-autoencoder framework further reveals a hierarchy where perceptual details and semantic structure are captured at different diffusion-time scales, clarifying why conditioning signals often have limited influence on raw image generation. Overall, the paper connects diffusion, manifold structure, and optimal betting into a cohesive picture of how information is stored, used, and limited in modern generative models, with practical implications for improving conditional generation and guidance strategies.

Abstract

We draw a connection between diffusion models and the Kelly criterion for maximizing returns in betting games. A signal that is correlated with the outcome of such a game can be used to focus the bets on a narrow range of high probability predictions. Diffusion models share the same paradigm in that they gradually concentrate the probability mass to fit the training data. We show that the information stored in an unconditional diffusion model captures, in part, the joint correlation between the components of the data variable . Conditional diffusion models store additional information to bind the signal with the conditioning information , equal to the mutual information between them. The latter is only a small fraction of the total information in the neural network if the data is low-dimensional. We examine why this does not hinder conditional generation.

Paper Structure

This paper contains 35 sections, 70 equations, 19 figures.

Figures (19)

  • Figure 1:
  • Figure 2: Left: Samples generated by a CFG-style modification to the conditional score $\nabla \log {p}(x_{t}, {t}|y)$ of a joint Gaussian, ${\bm{Y}} = A {\bm{X}} + \bm{{\varepsilon}}$ (cf. \ref{['eq:CFG', 'eq:LinearGaussianHigher']}). CFG strengthens the correlation between ${\bm{X}}$ and ${\bm{Y}}$, increasing their mutual information. But it also alters the relationship between them. Right: Mutual information under CFG for the joint Gaussian. We fix ${D_{{\bm{X}}}}=25$ and repeat the experiment with ${D_{{\bm{Y}}}}=5,10,15$. Notice how $I({\bm{X}}; {\bm{Y}})_{\rm CFG}$ increases as the guidance strength is ramped up (dashed lines are the original $I({\bm{X}}; {\bm{Y}})$ values, cf. \ref{['eq:LinearModelHigherMI']}). It saturates faster for smaller ${D_{{\bm{Y}}}}$, when ${\bm{Y}}$ has fewer degrees of freedom to encode the diversity in ${\bm{X}}$. See \ref{['fig:CFGonJointGaussian', 'fig:MIinCFG']} for more details.
  • Figure 3: Two ways of having a large $S^{{\bm{X}}|{\bm{Y}}}_{\rm tot}$. (Left) Flattening:$I({\bm{X}};{\bm{Y}})$ stays finite while $S^{{\bm{X}}}_{\rm tot}$ blows up due to lower intrinsic dimensionality of ${\bm{X}}$. (Right) Determinism:$I({\bm{X}};{\bm{Y}})$ diverges when ${\bm{Y}}$ is strongly correlated with ${\bm{X}}$, but $S^{{\bm{X}}}_{\rm tot}$ is under control since its covariance is full-rank. The entropy rate curves are the time derivatives of the corresponding entropy or mutual information. Similar plots for different degrees of flattening and correlation are given in \ref{['fig:MIcurves_LR', 'fig:MIcurves']}.
  • Figure 4: Probing the information stored in a diffusion model using a diffusion autoencoder. (Left column) t-SNE plots of ${\bm{z}}_{\rm sem}$ and ${\bm{z}}_{\rm per}$ for MNIST with $\tau=0.1{T}$. The former shows discernible clusters corresponding to the different digits, while the latter has no such structure. This indicates that the information collected from ${s} \in (0, \tau)$ contains little to now information about the digits. (Right) The correlation between the learned latents and the true labels, as quantified by $I({\bm{Z}}_\bullet;{\bm{Y}})$.
  • Figure 5: The linear Gaussian model from \ref{['eq:LinearGaussian']} with (a) higher noise/larger $\sigma_{\varepsilon}$, and (b) lower noise/smaller $\sigma_{\varepsilon}$. The blue curves are the conditionals $X|Y=y$ for some $y$, and the orange curve is the marginal over $X$. Notice how the conditionals have a tighter variance compared to the marginal. The contours are surfaces over constant probability in the joint distribution, and the red markers are some samples.
  • ...and 14 more figures