Towards a mathematical theory for consistency training in diffusion models
Gen Li, Zhihan Huang, Yuting Wei
TL;DR
The paper addresses the theoretical gaps in consistency training for diffusion models by establishing a non-asymptotic guarantee that a sequence of learned consistency mappings $\{f_t\}$ can yield single-step sampling that closely matches the data distribution. By framing the training as iterative consistency learning and imposing Lipschitz and estimation-error assumptions, the authors derive a Wasserstein bound $W_1(f_T(X_T), X_1) \le C_1 \frac{L_f^3 d^{5/2} \log^{5} T}{T} + \varepsilon + \varepsilon_{\mathcal{F}}$, and show that $T = \tilde{O}ig( \frac{L_f^3 d^{5/2}}{\varepsilon+\varepsilon_{\mathcal{F}}} \big)$ steps suffice to achieve $W_1 \le 2(\varepsilon+\varepsilon_{\mathcal{F}})$. The framework decouples training and sampling, enabling efficient one-shot sampling while providing a quantitative benchmark for fidelity that depends explicitly on the data dimension $d$ and Lipschitz constant $L_f$. The results offer a principled justification for consistency models and guide practical design of training schedules and model capacity for reliable fast sampling.
Abstract
Consistency models, which were proposed to mitigate the high computational overhead during the sampling phase of diffusion models, facilitate single-step sampling while attaining state-of-the-art empirical performance. When integrated into the training phase, consistency models attempt to train a sequence of consistency functions capable of mapping any point at any time step of the diffusion process to its starting point. Despite the empirical success, a comprehensive theoretical understanding of consistency training remains elusive. This paper takes a first step towards establishing theoretical underpinnings for consistency models. We demonstrate that, in order to generate samples within $\varepsilon$ proximity to the target in distribution (measured by some Wasserstein metric), it suffices for the number of steps in consistency learning to exceed the order of $d^{5/2}/\varepsilon$, with $d$ the data dimension. Our theory offers rigorous insights into the validity and efficacy of consistency models, illuminating their utility in downstream inference tasks.
