Table of Contents
Fetching ...

Dynamical Regimes of Multimodal Diffusion Models

Emil Albrychiewicz, Andrés Franco Valiente, Li-Ching Chen

TL;DR

The paper presents a theoretical framework for multimodal diffusion using coupled Ornstein-Uhlenbeck processes, situating generation dynamics within nonequilibrium phase-transition theory. By diagonalizing the coupled system into common and difference modes, it reveals a spectral-time hierarchy that yields speciation and collapse times, and introduces the synchronization gap as a key phenomenon in cross-modal generation. Analytical results for symmetric and anisotropic coupling provide stability bounds and schedule-dependent tuning of mode emergence, while MNIST and exact-score experiments validate the theory and demonstrate practical coupling strategies. The work suggests principled, time-dependent coupling schedules to target mode-specific timescales, offering a pathway beyond heuristic guidance tuning for robust multimodal diffusion models.

Abstract

Diffusion based generative models have achieved unprecedented fidelity in synthesizing high dimensional data, yet the theoretical mechanisms governing multimodal generation remain poorly understood. Here, we present a theoretical framework for coupled diffusion models, using coupled Ornstein-Uhlenbeck processes as a tractable model. By using the nonequilibrium statistical physics of dynamical phase transitions, we demonstrate that multimodal generation is governed by a spectral hierarchy of interaction timescales rather than simultaneous resolution. A key prediction is the ``synchronization gap'', a temporal window during the reverse generative process where distinct eigenmodes stabilize at different rates, providing a theoretical explanation for common desynchronization artifacts. We derive analytical conditions for speciation and collapse times under both symmetric and anisotropic coupling regimes, establishing strict bounds for coupling strength to avoid unstable symmetry breaking. We show that the coupling strength acts as a spectral filter that enforces a tunable temporal hierarchy on generation. We support these predictions through controlled experiments with diffusion models trained on MNIST datasets and exact score samplers. These results motivate time dependent coupling schedules that target mode specific timescales, offering a potential alternative to ad hoc guidance tuning.

Dynamical Regimes of Multimodal Diffusion Models

TL;DR

The paper presents a theoretical framework for multimodal diffusion using coupled Ornstein-Uhlenbeck processes, situating generation dynamics within nonequilibrium phase-transition theory. By diagonalizing the coupled system into common and difference modes, it reveals a spectral-time hierarchy that yields speciation and collapse times, and introduces the synchronization gap as a key phenomenon in cross-modal generation. Analytical results for symmetric and anisotropic coupling provide stability bounds and schedule-dependent tuning of mode emergence, while MNIST and exact-score experiments validate the theory and demonstrate practical coupling strategies. The work suggests principled, time-dependent coupling schedules to target mode-specific timescales, offering a pathway beyond heuristic guidance tuning for robust multimodal diffusion models.

Abstract

Diffusion based generative models have achieved unprecedented fidelity in synthesizing high dimensional data, yet the theoretical mechanisms governing multimodal generation remain poorly understood. Here, we present a theoretical framework for coupled diffusion models, using coupled Ornstein-Uhlenbeck processes as a tractable model. By using the nonequilibrium statistical physics of dynamical phase transitions, we demonstrate that multimodal generation is governed by a spectral hierarchy of interaction timescales rather than simultaneous resolution. A key prediction is the ``synchronization gap'', a temporal window during the reverse generative process where distinct eigenmodes stabilize at different rates, providing a theoretical explanation for common desynchronization artifacts. We derive analytical conditions for speciation and collapse times under both symmetric and anisotropic coupling regimes, establishing strict bounds for coupling strength to avoid unstable symmetry breaking. We show that the coupling strength acts as a spectral filter that enforces a tunable temporal hierarchy on generation. We support these predictions through controlled experiments with diffusion models trained on MNIST datasets and exact score samplers. These results motivate time dependent coupling schedules that target mode specific timescales, offering a potential alternative to ad hoc guidance tuning.
Paper Structure (12 sections, 129 equations, 14 figures, 1 table)

This paper contains 12 sections, 129 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Impact of coupling strength $g$ on speciation time $t_s$. Analytic solutions (solid and dashed) show that the Common Mode only case exhibits a non-monotonic relationship, peaking near $g=0.45$, while the Difference Mode leads to a monotonic decrease in speciation time. Numerical results for mixtures (dotted/dash-dotted) demonstrate how intermediate modal configurations interpolate between these extremes. Parameters are set to $\beta = 1, \sigma_W^2 = 2,$ and $\sigma^2 = 1$.
  • Figure 2: Collapse time vs coupling strength $g$. We plot analytic solutions for the collapse time of common and difference modes and numerical solutions. For plotting and numerical evaluation, we set $\alpha=1$ and $\sigma^2/\sigma_W^2=1$.
  • Figure 3: Numerical solution for speciation time vs. coupling strength $g$ in the uninformative context regime ($\mu_x(0)=0$ and $m_y^2=2$). The monotonic decrease in $t_S$ indicates that stronger coupling delays class speciation toward the end of the reverse process (closer to $t=0$).
  • Figure 4: Numerical solution for speciation time vs. coupling constant $g$ and angle $\theta$ in the case of $m_x^2=m_y^2=1$. For plotting we set $\sigma_W^2=2$, $\sigma_x^2=\sigma_y^2=1$ and $\beta=1$. White color labels region where $\kappa(t)<1$ for all $t$. The blue dashed line indicates an approximate $g_{\text{crit}}(\theta)$.
  • Figure 5: Numerical solutions for joint collapse time $t_C$ and conditional collapse time $t_{C,y|x}$. For plotting we set $\sigma_W^2=1$, $\sigma_x^2=\sigma_y^2=1$, $\beta=1$ and $\alpha=1$.
  • ...and 9 more figures