Dynamical Regimes of Multimodal Diffusion Models
Emil Albrychiewicz, Andrés Franco Valiente, Li-Ching Chen
TL;DR
The paper presents a theoretical framework for multimodal diffusion using coupled Ornstein-Uhlenbeck processes, situating generation dynamics within nonequilibrium phase-transition theory. By diagonalizing the coupled system into common and difference modes, it reveals a spectral-time hierarchy that yields speciation and collapse times, and introduces the synchronization gap as a key phenomenon in cross-modal generation. Analytical results for symmetric and anisotropic coupling provide stability bounds and schedule-dependent tuning of mode emergence, while MNIST and exact-score experiments validate the theory and demonstrate practical coupling strategies. The work suggests principled, time-dependent coupling schedules to target mode-specific timescales, offering a pathway beyond heuristic guidance tuning for robust multimodal diffusion models.
Abstract
Diffusion based generative models have achieved unprecedented fidelity in synthesizing high dimensional data, yet the theoretical mechanisms governing multimodal generation remain poorly understood. Here, we present a theoretical framework for coupled diffusion models, using coupled Ornstein-Uhlenbeck processes as a tractable model. By using the nonequilibrium statistical physics of dynamical phase transitions, we demonstrate that multimodal generation is governed by a spectral hierarchy of interaction timescales rather than simultaneous resolution. A key prediction is the ``synchronization gap'', a temporal window during the reverse generative process where distinct eigenmodes stabilize at different rates, providing a theoretical explanation for common desynchronization artifacts. We derive analytical conditions for speciation and collapse times under both symmetric and anisotropic coupling regimes, establishing strict bounds for coupling strength to avoid unstable symmetry breaking. We show that the coupling strength acts as a spectral filter that enforces a tunable temporal hierarchy on generation. We support these predictions through controlled experiments with diffusion models trained on MNIST datasets and exact score samplers. These results motivate time dependent coupling schedules that target mode specific timescales, offering a potential alternative to ad hoc guidance tuning.
