SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

Yuda Song; Zehao Sun; Xuanwu Yin

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

Yuda Song, Zehao Sun, Xuanwu Yin

TL;DR

We address the latency of diffusion models by jointly distilling model components and adopting a one-step training regime. The SDXS framework compresses the VAE and U-Net via distillation and introduces Segmented Score Distillation together with feature matching to obtain a true one-step generator, achieving real-time inference (about 100 FPS for 512×512 and 30 FPS for 1024×1024 on a single GPU). The approach extends to image-conditioned generation through a distilled ControlNet and LoRA-finetuning, enabling efficient image-to-image translation with practical control. This yields a scalable path to deploy high-quality diffusion-based generation on edge devices and in low-resource settings, while maintaining competitive quality and controllability.

Abstract

Recent advancements in diffusion models have positioned them at the forefront of image generation. Despite their superior performance, diffusion models are not without drawbacks; they are characterized by complex architectures and substantial computational demands, resulting in significant latency due to their iterative sampling process. To mitigate these limitations, we introduce a dual approach involving model miniaturization and a reduction in sampling steps, aimed at significantly decreasing model latency. Our methodology leverages knowledge distillation to streamline the U-Net and image decoder architectures, and introduces an innovative one-step DM training technique that utilizes feature matching and score distillation. We present two models, SDXS-512 and SDXS-1024, achieving inference speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a single GPU, respectively. Moreover, our training approach offers promising applications in image-conditioned control, facilitating efficient image-to-image translation.

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

TL;DR

Abstract

Paper Structure (18 sections, 14 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 14 equations, 9 figures, 3 tables, 1 algorithm.

Introduction
Preliminary
Diffusion Models
Diff-Instruct
Method
Architecture Optimizations
VAE.
U-Net.
ControlNet.
One-Step Training
Feature Matching Warmup.
Segmented Score Distillation.
LoRA.
ControlNet.
Experiment
...and 3 more sections

Figures (9)

Figure 1: Assuming the image generation time is limited to 1 second, then SDXL can only use 16 NFEs to produce a slightly blurry image, while SDXS-1024 can generate 30 clear images. Besides, our proposed method can also train ControlNet.
Figure 2: Network architecture distillation, including image decoder, U-Net and ControlNet.
Figure 3: The proposed one-step U-Net training strategy based on feature matching and score distillation. The dashed lines indicate the gradient backpropagation.
Figure 4: The proposed LoRA training strategy based on feature matching and score distillation. The dashed lines indicate the gradient backpropagation.
Figure 5: The proposed one-step ControlNet training strategy based on feature matching and score distillation. The dashed lines indicate the gradient backpropagation.
...and 4 more figures

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

TL;DR

Abstract

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

Authors

TL;DR

Abstract

Table of Contents

Figures (9)