Table of Contents
Fetching ...

Phased Consistency Models

Fu-Yun Wang, Zhaoyang Huang, Alexander William Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, Xiaogang Wang, Hongsheng Li

TL;DR

The paper tackles the limitations of latent consistency models in high-resolution text-conditioned generation by introducing Phased Consistency Models (PCMs) that partition the diffusion trajectory into sub-trajectories, enabling deterministic, multi-step sampling with improved controllability and efficiency. PCMs are paired with a novel guided distillation approach and an adversarial consistency loss to boost few-step generation quality, achieving state-of-the-art results on image and video benchmarks while maintaining compatibility with CFG settings. The method demonstrates superior or competitive performance against existing CM-based approaches across 1–16 steps and extends effectively to text-to-video generation, all backed by open-source code. Overall, PCM broadens the design space for fast, high-fidelity diffusion-based generation and offers practical benefits for multi-modal synthesis with reduced compute.

Abstract

Consistency Models (CMs) have made significant progress in accelerating the generation of diffusion models. However, their application to high-resolution, text-conditioned image generation in the latent space remains unsatisfactory. In this paper, we identify three key flaws in the current design of Latent Consistency Models (LCMs). We investigate the reasons behind these limitations and propose Phased Consistency Models (PCMs), which generalize the design space and address the identified limitations. Our evaluations demonstrate that PCMs outperform LCMs across 1--16 step generation settings. While PCMs are specifically designed for multi-step refinement, they achieve comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show the methodology of PCMs is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. Our code is available at https://github.com/G-U-N/Phased-Consistency-Model.

Phased Consistency Models

TL;DR

The paper tackles the limitations of latent consistency models in high-resolution text-conditioned generation by introducing Phased Consistency Models (PCMs) that partition the diffusion trajectory into sub-trajectories, enabling deterministic, multi-step sampling with improved controllability and efficiency. PCMs are paired with a novel guided distillation approach and an adversarial consistency loss to boost few-step generation quality, achieving state-of-the-art results on image and video benchmarks while maintaining compatibility with CFG settings. The method demonstrates superior or competitive performance against existing CM-based approaches across 1–16 steps and extends effectively to text-to-video generation, all backed by open-source code. Overall, PCM broadens the design space for fast, high-fidelity diffusion-based generation and offers practical benefits for multi-modal synthesis with reduced compute.

Abstract

Consistency Models (CMs) have made significant progress in accelerating the generation of diffusion models. However, their application to high-resolution, text-conditioned image generation in the latent space remains unsatisfactory. In this paper, we identify three key flaws in the current design of Latent Consistency Models (LCMs). We investigate the reasons behind these limitations and propose Phased Consistency Models (PCMs), which generalize the design space and address the identified limitations. Our evaluations demonstrate that PCMs outperform LCMs across 1--16 step generation settings. While PCMs are specifically designed for multi-step refinement, they achieve comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show the methodology of PCMs is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. Our code is available at https://github.com/G-U-N/Phased-Consistency-Model.
Paper Structure (39 sections, 6 theorems, 48 equations, 35 figures, 8 tables, 2 algorithms)

This paper contains 39 sections, 6 theorems, 48 equations, 35 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

For arbitrary sub-trajectory $\left[s_m, s_{m+1}\right]$. let $\Delta t_{m} := \max_{t_{n}, t_{n+1} \in [s_m, s_{m+1}]}\{|t_{n+1}-t_{n}|\}$, and $\boldsymbol f^m(\cdot,\cdot ;\boldsymbol \phi)$ be the target phased consistency function induced by the pre-trained diffusion model (empirical PF-ODE). A

Figures (35)

  • Figure 1: PCMs: Towards stable and fast image and video generation.
  • Figure 2: Summative motivation. We observe and summarize three crucial limitations for (latent) consistency models, and generalize the design space, well tackling all these limitations.
  • Figure 3: (Left) Illustrative comparison of diffusion models ddpm, consistency models cm, consistency trajectory models ctm, and our phased consistency model. (Right) Simplified visualization of the forward SDE and reverse-time PF-ODE trajectories.
  • Figure 4: Training paradigm of PCMs. '?' means optional usage.
  • Figure 5: Qualitative Comparison. Our method achieves top-tier performance.
  • ...and 30 more figures

Theorems & Definitions (12)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • Theorem 5
  • proof
  • ...and 2 more