Table of Contents
Fetching ...

SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation

Hongjian Liu, Qingsong Xie, TianXiang Ye, Zhijie Deng, Chen Chen, Shixiang Tang, Xueyang Fu, Haonan Lu, Zheng-jun Zha

TL;DR

SCott introduces Stochastic Consistency Distillation, a framework that merges consistency distillation with stochastic differential equation solvers to drastically accelerate text-to-image generation. By controlling noise, employing multi-step sampling, and incorporating adversarial refinement via a LoRA-based discriminator with text and time conditioning, SCott achieves state-of-the-art quality in 2 steps (FID 21.9 on MSCOCO-2017 5K with SD1.5) and improved diversity, with further gains as steps increase. The authors provide a theoretical convergence justification for CD with SDE solvers and validate the approach through extensive experiments, ablations, and qualitative analyses, including comparisons to Dreamshaper-based setups. Overall, SCott offers a scalable, efficient pathway to high-resolution image generation with few steps, leveraging stochasticity to strengthen the teacher and enhance sample diversity and fidelity.

Abstract

The iterative sampling procedure employed by diffusion models (DMs) often leads to significant inference latency. To address this, we propose Stochastic Consistency Distillation (SCott) to enable accelerated text-to-image generation, where high-quality and diverse generations can be achieved within just 2-4 sampling steps. In contrast to vanilla consistency distillation (CD) which distills the ordinary differential equation solvers-based sampling process of a pre-trained teacher model into a student, SCott explores the possibility and validates the efficacy of integrating stochastic differential equation (SDE) solvers into CD to fully unleash the potential of the teacher. SCott is augmented with elaborate strategies to control the noise strength and sampling process of the SDE solver. An adversarial loss is further incorporated to strengthen the consistency constraints in rare sampling steps. Empirically, on the MSCOCO-2017 5K dataset with a Stable Diffusion-V1.5 teacher, SCott achieves an FID of 21.9 with 2 sampling steps, surpassing that of the 1-step InstaFlow (23.4) and the 4-step UFOGen (22.1). Moreover, SCott can yield more diverse samples than other consistency models for high-resolution image generation, with up to 16% improvement in a qualified metric.

SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation

TL;DR

SCott introduces Stochastic Consistency Distillation, a framework that merges consistency distillation with stochastic differential equation solvers to drastically accelerate text-to-image generation. By controlling noise, employing multi-step sampling, and incorporating adversarial refinement via a LoRA-based discriminator with text and time conditioning, SCott achieves state-of-the-art quality in 2 steps (FID 21.9 on MSCOCO-2017 5K with SD1.5) and improved diversity, with further gains as steps increase. The authors provide a theoretical convergence justification for CD with SDE solvers and validate the approach through extensive experiments, ablations, and qualitative analyses, including comparisons to Dreamshaper-based setups. Overall, SCott offers a scalable, efficient pathway to high-resolution image generation with few steps, leveraging stochasticity to strengthen the teacher and enhance sample diversity and fidelity.

Abstract

The iterative sampling procedure employed by diffusion models (DMs) often leads to significant inference latency. To address this, we propose Stochastic Consistency Distillation (SCott) to enable accelerated text-to-image generation, where high-quality and diverse generations can be achieved within just 2-4 sampling steps. In contrast to vanilla consistency distillation (CD) which distills the ordinary differential equation solvers-based sampling process of a pre-trained teacher model into a student, SCott explores the possibility and validates the efficacy of integrating stochastic differential equation (SDE) solvers into CD to fully unleash the potential of the teacher. SCott is augmented with elaborate strategies to control the noise strength and sampling process of the SDE solver. An adversarial loss is further incorporated to strengthen the consistency constraints in rare sampling steps. Empirically, on the MSCOCO-2017 5K dataset with a Stable Diffusion-V1.5 teacher, SCott achieves an FID of 21.9 with 2 sampling steps, surpassing that of the 1-step InstaFlow (23.4) and the 4-step UFOGen (22.1). Moreover, SCott can yield more diverse samples than other consistency models for high-resolution image generation, with up to 16% improvement in a qualified metric.
Paper Structure (33 sections, 3 theorems, 21 equations, 16 figures, 9 tables, 2 algorithms)

This paper contains 33 sections, 3 theorems, 21 equations, 16 figures, 9 tables, 2 algorithms.

Key Result

Theorem 1

Let $\Delta t:= \max\limits_{n\in [1,N]} {|t_{n+1} - t_{n}|}$ where $t\in [\tau, T]$. Assume ${\bm{f}}_{\theta}(\cdot , \cdot)$ is Lipschitz in ${\bm{x}}$ with constant $L_1$. Denote ${\bm{f}}(\cdot , \cdot)$ the consistency function of the SDE defined in eq: re-sde. Assume the SDE solver $\Phi_{SDE

Figures (16)

  • Figure 1: $512 \times 512$ resolution images generated by SCott using 2 sampling steps. SCott is trained based on Realistic-Vision-v51.
  • Figure 2: Overview of SCott. SCott distills a pre-trained teacher DM into a student one for accelerated sampling. Compared to the vanilla consistency distillation approach, we introduce a multi-step SDE solver to establish a stronger and more versatile teacher. We train the student model with CD loss using SDE solvers. Additionally, we include an adversarial learning loss to correct student output, boosting the sample quality with rare sampling steps. Note that we omit the EMA operation for the teacher for brevity.
  • Figure 3: Multi-step SDE solver sampling
  • Figure 4: Comparison of stochastic CD (based on SDE solvers) and vanilla CD (based on ODE solvers) on a synthetic generation task. The other experimental settings for the two cases are identical.
  • Figure 5: Qualitative comparisons of SCott against competing methods and DDIM, DPM++ baselines. All models are initialized by SD1.5.
  • ...and 11 more figures

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Theorem 2
  • Lemma 3: Proof in song2023consistency
  • proof : Proof of lemma \ref{['lemma1']} in song2023consistency
  • proof : Proof of \ref{['proof sde']}