Understanding Representation Dynamics of Diffusion Models via Low-Dimensional Modeling
Xiao Li, Zekai Zhang, Xiang Li, Siyi Chen, Zhihui Zhu, Peng Wang, Qing Qu
TL;DR
Diffusion models exhibit unimodal representation dynamics where feature quality peaks at an intermediate noise level; the paper explains this via a low-rank mixture data model (MoLRG) and a tractable denoiser parameterization. It develops a theoretical framework with a $SNR$-based representation metric and proves that the optimal denoiser aligns with ground-truth subspaces, producing unimodal $SNR$ curves across diffusion time. Empirically, unimodal dynamics predict model generalization in classification and track transitions to memorization with dataset size, model capacity, and training duration. These results bridge distribution learning and representation learning in diffusion models, offering a principled basis for early stopping and representation-based evaluations.
Abstract
Diffusion models, though originally designed for generative tasks, have demonstrated impressive self-supervised representation learning capabilities. A particularly intriguing phenomenon in these models is the emergence of unimodal representation dynamics, where the quality of learned features peaks at an intermediate noise level. In this work, we conduct a comprehensive theoretical and empirical investigation of this phenomenon. Leveraging the inherent low-dimensionality structure of image data, we theoretically demonstrate that the unimodal dynamic emerges when the diffusion model successfully captures the underlying data distribution. The unimodality arises from an interplay between denoising strength and class confidence across noise scales. Empirically, we further show that, in classification tasks, the presence of unimodal dynamics reliably reflects the generalization of the diffusion model: it emerges when the model generates novel images and gradually transitions to a monotonically decreasing curve as the model begins to memorize the training data.
