SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions
Yuda Song, Zehao Sun, Xuanwu Yin
TL;DR
We address the latency of diffusion models by jointly distilling model components and adopting a one-step training regime. The SDXS framework compresses the VAE and U-Net via distillation and introduces Segmented Score Distillation together with feature matching to obtain a true one-step generator, achieving real-time inference (about 100 FPS for 512×512 and 30 FPS for 1024×1024 on a single GPU). The approach extends to image-conditioned generation through a distilled ControlNet and LoRA-finetuning, enabling efficient image-to-image translation with practical control. This yields a scalable path to deploy high-quality diffusion-based generation on edge devices and in low-resource settings, while maintaining competitive quality and controllability.
Abstract
Recent advancements in diffusion models have positioned them at the forefront of image generation. Despite their superior performance, diffusion models are not without drawbacks; they are characterized by complex architectures and substantial computational demands, resulting in significant latency due to their iterative sampling process. To mitigate these limitations, we introduce a dual approach involving model miniaturization and a reduction in sampling steps, aimed at significantly decreasing model latency. Our methodology leverages knowledge distillation to streamline the U-Net and image decoder architectures, and introduces an innovative one-step DM training technique that utilizes feature matching and score distillation. We present two models, SDXS-512 and SDXS-1024, achieving inference speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a single GPU, respectively. Moreover, our training approach offers promising applications in image-conditioned control, facilitating efficient image-to-image translation.
