Table of Contents
Fetching ...

SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation

Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, Enze Xie

TL;DR

SANA-Sprint introduces a training-free transformation to convert a pre-trained Flow Matching model into a TrigFlow-based continuous-time consistency framework and couples it with latent adversarial distillation to achieve ultra-fast, high-quality 1024×1024 T2I generation in 1–4 steps. By stabilizing training with dense time embeddings and QK-normalization and incorporating LADD with an optional max-time weighting, the method achieves state-of-the-art FID/GenEval (7.59/0.74) while delivering orders-of-magnitude faster inference (0.1s on H100, 0.25s with ControlNet). Real-time interactive generation is enabled via ControlNet integration, supporting instant visual feedback. The work positions SANA-Sprint as a practical, open-source platform for AI-powered consumer applications, offering a robust speed-quality frontier and human-in-the-loop capabilities.

Abstract

This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. (2) SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. (3) We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in only 1 step - outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10x faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024 x 1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and pre-trained models will be open-sourced.

SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation

TL;DR

SANA-Sprint introduces a training-free transformation to convert a pre-trained Flow Matching model into a TrigFlow-based continuous-time consistency framework and couples it with latent adversarial distillation to achieve ultra-fast, high-quality 1024×1024 T2I generation in 1–4 steps. By stabilizing training with dense time embeddings and QK-normalization and incorporating LADD with an optional max-time weighting, the method achieves state-of-the-art FID/GenEval (7.59/0.74) while delivering orders-of-magnitude faster inference (0.1s on H100, 0.25s with ControlNet). Real-time interactive generation is enabled via ControlNet integration, supporting instant visual feedback. The work positions SANA-Sprint as a practical, open-source platform for AI-powered consumer applications, offering a robust speed-quality frontier and human-in-the-loop capabilities.

Abstract

This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. (2) SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. (3) We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in only 1 step - outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10x faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024 x 1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and pre-trained models will be open-sourced.

Paper Structure

This paper contains 37 sections, 1 theorem, 27 equations, 11 figures, 8 tables, 2 algorithms.

Key Result

Proposition 3.1

Given a noisy data $\frac{\boldsymbol{x}_{t,\texttt{Trig}}}{\sigma_d}$ under TrigFlow noise schedule, a flow matching model can denoise it via $\boldsymbol{v_{\theta}}(\boldsymbol{x}_{t, \texttt{FM}}, t_{\texttt{FM}}, \boldsymbol{y})$, where Given $\boldsymbol{v_{\theta}}(\boldsymbol{x}_{t, \texttt{FM}}, t_{\texttt{FM}}, \boldsymbol{y})$, the best estimator for the TrigFlow model $\boldsymbol{F_{

Figures (11)

  • Figure 1: (a) Our SANA-Sprint accelerate the inference speed for generating 1024 $\times$ 1024 images, achieving a remarkable speedup from FULX-Schnell's 1.94 seconds to only 0.03 seconds. This represents a 64$\times$ improvement over the current state-of-the-art step-distilled model, FLUX-Schnell, as measured with a batch size of 1 on an NVIDIA A100 GPU. The ratio is calculated based on Transformer latency. (b) Additionally, our model demonstrates efficient GPU memory usage during training, outperforming other distillation methods in terms of memory cost. The GPU memory is measured using official code, 1024 $\times$ 1024 images and on a single A100 GPU.
  • Figure 2: Training paradigm of SANA-Sprint. In SANA-Sprint, we use the student model for synthetic data generation ($\hat{x_0}$) and $\text{JVP}$ calculation, and we use the teacher model for velocity ($\mathrm{d}x/\mathrm{d}t$) compute and its feature for the GAN loss, which allows us train sCM and GAN together and have only one training model purely in the latent space. Details of training objective and TrigFlow Transformation are in \ref{['eq:scm loss']}, \ref{['eq:gan generator loss']} and Sec. \ref{['Sec:trans']}.
  • Figure 3: Efficient Distillation via QK Normalization, Dense Timestep Embedding, and Training-free Schedule Transformation. (a) We compare gradient norms and visualizations with/without QK Normalization, showing its stabilizing effect. (b) Gradient norm curves for timestep scales (0$\sim$1 vs. 0$\sim$1000) highlight impacts on stability and stability and quality. (c) PCA-based similarity analysis of timestep embeddings. (d) Image results after 5,000 iterations of fine-tuning with (left) and without (right) the proposed schedule transfer (\ref{['Sec:trans']}).
  • Figure 4: Visual comparison among SANA-Sprint and selected competing methods in different inference steps. † indicates that distinct models are required for different inference steps, and time below the method name is the latency of 4 steps tested on A100 GPU. SANA-Sprint produces images with superior realism and text alignment in all inference steps with the fastest speed.
  • Figure 5: Visual comparison among SANA-Sprint with different inference steps and the teacher model SANA. SANA-Sprint can generate high-quality images with one or two steps and the images can be better when increasing steps.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Remark
  • Proposition 3.1
  • proof