See Further When Clear: Curriculum Consistency Model
Yunpeng Liu, Boxiao Liu, Yi Zhang, Xingzhong Hou, Guanglu Song, Yu Liu, Haihang You
TL;DR
This work tackles instability in consistency-based generative modeling by introducing a PSNR-based Knowledge Discrepancy of the Curriculum ($KDC$) and an adaptive, multi-step distillation scheme that maintains balanced learning difficulty across timesteps. The Curriculum Consistency Model (CCM) dynamically adjusts learning targets and uses a multi-step teacher iteration to produce reliable distillation signals, yielding strong one-step FID scores on CIFAR-10 ($1.64$) and ImageNet 64×64 ($2.18$), and extending to large-scale diffusion (Stable Diffusion XL) and flow-matching (Stable Diffusion 3) pipelines with improved image-text alignment. A unified distillation loss combines KDC-driven targets with adversarial losses, enabling CCM to outperform prior CM variants across both diffusion and flow-matching settings. The results suggest CCM’s curriculum-aware approach improves sampling efficiency and sample quality, offering practical gains for high-resolution, text-conditioned synthesis and highlighting avenues for future dynamic curriculum thresholds and sampling strategies.
Abstract
Significant advances have been made in the sampling efficiency of diffusion models and flow matching models, driven by Consistency Distillation (CD), which trains a student model to mimic the output of a teacher model at a later timestep. However, we found that the learning complexity of the student model varies significantly across different timesteps, leading to suboptimal performance in CD.To address this issue, we propose the Curriculum Consistency Model (CCM), which stabilizes and balances the learning complexity across timesteps. Specifically, we regard the distillation process at each timestep as a curriculum and introduce a metric based on Peak Signal-to-Noise Ratio (PSNR) to quantify the learning complexity of this curriculum, then ensure that the curriculum maintains consistent learning complexity across different timesteps by having the teacher model iterate more steps when the noise intensity is low. Our method achieves competitive single-step sampling Fréchet Inception Distance (FID) scores of 1.64 on CIFAR-10 and 2.18 on ImageNet 64x64.Moreover, we have extended our method to large-scale text-to-image models and confirmed that it generalizes well to both diffusion models (Stable Diffusion XL) and flow matching models (Stable Diffusion 3). The generated samples demonstrate improved image-text alignment and semantic structure, since CCM enlarges the distillation step at large timesteps and reduces the accumulated error.
