Diffusion Models are Kelly Gamblers
Akhil Premkumar
TL;DR
This work reframes diffusion models through an information-theoretic lens, likening their probability-shaping process to Kelly gambling and identifying two main information budgets: neural entropy (within X) and mutual information linking X to conditioning signals Y. It introduces entropy-matching as a principled way to quantify the information needed to reverse diffusion and recover data distributions, distinguishing intra-X correlations (TC(X)) from X–Y dependencies (I(X;Y)). Through joint Gaussian and image experiments, it shows that neural entropy dominates the information budget in images, while I(X;Y) remains a smaller, augmenting quantity; CFG can boost MI but eventually saturates and may distort the learned distribution. The diffusion-autoencoder framework further reveals a hierarchy where perceptual details and semantic structure are captured at different diffusion-time scales, clarifying why conditioning signals often have limited influence on raw image generation. Overall, the paper connects diffusion, manifold structure, and optimal betting into a cohesive picture of how information is stored, used, and limited in modern generative models, with practical implications for improving conditional generation and guidance strategies.
Abstract
We draw a connection between diffusion models and the Kelly criterion for maximizing returns in betting games. A signal that is correlated with the outcome of such a game can be used to focus the bets on a narrow range of high probability predictions. Diffusion models share the same paradigm in that they gradually concentrate the probability mass to fit the training data. We show that the information stored in an unconditional diffusion model captures, in part, the joint correlation between the components of the data variable $X$. Conditional diffusion models store additional information to bind the signal $X$ with the conditioning information $Y$, equal to the mutual information between them. The latter is only a small fraction of the total information in the neural network if the data is low-dimensional. We examine why this does not hinder conditional generation.
