SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation
Hongjian Liu, Qingsong Xie, TianXiang Ye, Zhijie Deng, Chen Chen, Shixiang Tang, Xueyang Fu, Haonan Lu, Zheng-jun Zha
TL;DR
SCott introduces Stochastic Consistency Distillation, a framework that merges consistency distillation with stochastic differential equation solvers to drastically accelerate text-to-image generation. By controlling noise, employing multi-step sampling, and incorporating adversarial refinement via a LoRA-based discriminator with text and time conditioning, SCott achieves state-of-the-art quality in 2 steps (FID 21.9 on MSCOCO-2017 5K with SD1.5) and improved diversity, with further gains as steps increase. The authors provide a theoretical convergence justification for CD with SDE solvers and validate the approach through extensive experiments, ablations, and qualitative analyses, including comparisons to Dreamshaper-based setups. Overall, SCott offers a scalable, efficient pathway to high-resolution image generation with few steps, leveraging stochasticity to strengthen the teacher and enhance sample diversity and fidelity.
Abstract
The iterative sampling procedure employed by diffusion models (DMs) often leads to significant inference latency. To address this, we propose Stochastic Consistency Distillation (SCott) to enable accelerated text-to-image generation, where high-quality and diverse generations can be achieved within just 2-4 sampling steps. In contrast to vanilla consistency distillation (CD) which distills the ordinary differential equation solvers-based sampling process of a pre-trained teacher model into a student, SCott explores the possibility and validates the efficacy of integrating stochastic differential equation (SDE) solvers into CD to fully unleash the potential of the teacher. SCott is augmented with elaborate strategies to control the noise strength and sampling process of the SDE solver. An adversarial loss is further incorporated to strengthen the consistency constraints in rare sampling steps. Empirically, on the MSCOCO-2017 5K dataset with a Stable Diffusion-V1.5 teacher, SCott achieves an FID of 21.9 with 2 sampling steps, surpassing that of the 1-step InstaFlow (23.4) and the 4-step UFOGen (22.1). Moreover, SCott can yield more diverse samples than other consistency models for high-resolution image generation, with up to 16% improvement in a qualified metric.
