Phased Consistency Models
Fu-Yun Wang, Zhaoyang Huang, Alexander William Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, Xiaogang Wang, Hongsheng Li
TL;DR
The paper tackles the limitations of latent consistency models in high-resolution text-conditioned generation by introducing Phased Consistency Models (PCMs) that partition the diffusion trajectory into sub-trajectories, enabling deterministic, multi-step sampling with improved controllability and efficiency. PCMs are paired with a novel guided distillation approach and an adversarial consistency loss to boost few-step generation quality, achieving state-of-the-art results on image and video benchmarks while maintaining compatibility with CFG settings. The method demonstrates superior or competitive performance against existing CM-based approaches across 1–16 steps and extends effectively to text-to-video generation, all backed by open-source code. Overall, PCM broadens the design space for fast, high-fidelity diffusion-based generation and offers practical benefits for multi-modal synthesis with reduced compute.
Abstract
Consistency Models (CMs) have made significant progress in accelerating the generation of diffusion models. However, their application to high-resolution, text-conditioned image generation in the latent space remains unsatisfactory. In this paper, we identify three key flaws in the current design of Latent Consistency Models (LCMs). We investigate the reasons behind these limitations and propose Phased Consistency Models (PCMs), which generalize the design space and address the identified limitations. Our evaluations demonstrate that PCMs outperform LCMs across 1--16 step generation settings. While PCMs are specifically designed for multi-step refinement, they achieve comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show the methodology of PCMs is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. Our code is available at https://github.com/G-U-N/Phased-Consistency-Model.
