Demystifying Diffusion Objectives: Reweighted Losses are Better Variational Bounds
Jiaxin Shi, Michalis K. Titsias
TL;DR
The paper reframes diffusion-model training by deriving a cascade of time-dependent variational lower bounds on the data log-likelihood, revealing that reweighted losses can yield tighter bounds than the standard ELBO and reduce data-model KL divergences. It introduces the concept of optimal decoders to obtain improved ELBOs and proves that incorporating more optimal steps tightens the bound, while highlighting a tradeoff with sampling tractability. By showing that common reweighted objectives are equivalent to weighted sums of these improved bounds, the work provides a general theoretical justification for reweighted losses, extending from continuous Gaussian to masked diffusion models. The authors adapt these ideas to masked diffusion, deriving weighting schemes that respect the log-SNR parameterization and demonstrate substantial improvements in pixel-space ImageNet 64×64 generation (e.g., FID improvements up to 1.92 with 324M parameters). Overall, the work clarifies the theoretical basis for reweighted losses, demonstrates their applicability to discrete diffusion, and reports strong empirical gains in sample quality, suggesting practical impact for diffusion-based generative modeling.
Abstract
We derive a new theoretical interpretation of the reweighted losses that are widely used for training diffusion models. Our method is based on constructing a cascade of time-dependent variational lower bounds on the data log-likelihood, that provably improves upon the standard evidence lower bound and results in reduced data-model KL-divergences. Combining such bounds gives rise to reweighted objectives that can be applied to any generative diffusion model including both continuous Gaussian diffusion and masked (discrete) diffusion models. Then, we showcase this framework in masked diffusion and report significant improvements over previous training losses in pixel-space image modeling, approaching sample quality comparable to continuous diffusion models. Our results also provide a theoretical justification for the simple weighting scheme widely used in masked image models.
