Table of Contents
Fetching ...

SpeedUpNet: A Plug-and-Play Adapter Network for Accelerating Text-to-Image Diffusion Models

Weilong Chai, DanDan Zheng, Jiajiong Cao, Zhiquan Chen, Changbao Wang, Chenguang Ma

TL;DR

SpeedUpNet (SUN) introduces a universal, plug‑in adapter for cross‑attention in diffusion models to accelerate text‑to‑image generation while preserving content fidelity and negative‑prompt controllability. It learns a negative–positive prompt offset and utilizes Attention Normalization, enabling a single forward pass to approximate the CFG‑guided output and a Multi‑Step Consistency (MSC) distillation to stabilize outputs across multi‑step acceleration. Trained on base Stable Diffusion v1.5, SUN can be freely plugged into various fine‑tuned SD models without further training, delivering the equivalent of a 4‑step inference with over a 10× speedup and achieving competitive or state‑of‑the‑art FID/CLIP scores on the LAION‑Aesthetic‑6+ dataset. The approach also integrates with Inpainting, Image‑to‑Image, and ControlNet, offering a practical, training‑free path to universal acceleration across stylized diffusion models with stable, controllable outputs.

Abstract

Text-to-image diffusion models (SD) exhibit significant advancements while requiring extensive computational resources. Existing acceleration methods usually require extensive training and are not universally applicable. LCM-LoRA, trainable once for diverse models, offers universality but rarely considers ensuring the consistency of generated content before and after acceleration. This paper proposes SpeedUpNet (SUN), an innovative acceleration module, to address the challenges of universality and consistency. Exploiting the role of cross-attention layers in U-Net for SD models, we introduce an adapter specifically designed for these layers, quantifying the offset in image generation caused by negative prompts relative to positive prompts. This learned offset demonstrates stability across a range of models, enhancing SUN's universality. To improve output consistency, we propose a Multi-Step Consistency (MSC) loss, which stabilizes the offset and ensures fidelity in accelerated content. Experiments on SD v1.5 show that SUN leads to an overall speedup of more than 10 times compared to the baseline 25-step DPM-solver++, and offers two extra advantages: (1) training-free integration into various fine-tuned Stable-Diffusion models and (2) state-of-the-art FIDs of the generated data set before and after acceleration guided by random combinations of positive and negative prompts. Code is available: https://williechai.github.io/speedup-plugin-for-stable-diffusions.github.io.

SpeedUpNet: A Plug-and-Play Adapter Network for Accelerating Text-to-Image Diffusion Models

TL;DR

SpeedUpNet (SUN) introduces a universal, plug‑in adapter for cross‑attention in diffusion models to accelerate text‑to‑image generation while preserving content fidelity and negative‑prompt controllability. It learns a negative–positive prompt offset and utilizes Attention Normalization, enabling a single forward pass to approximate the CFG‑guided output and a Multi‑Step Consistency (MSC) distillation to stabilize outputs across multi‑step acceleration. Trained on base Stable Diffusion v1.5, SUN can be freely plugged into various fine‑tuned SD models without further training, delivering the equivalent of a 4‑step inference with over a 10× speedup and achieving competitive or state‑of‑the‑art FID/CLIP scores on the LAION‑Aesthetic‑6+ dataset. The approach also integrates with Inpainting, Image‑to‑Image, and ControlNet, offering a practical, training‑free path to universal acceleration across stylized diffusion models with stable, controllable outputs.

Abstract

Text-to-image diffusion models (SD) exhibit significant advancements while requiring extensive computational resources. Existing acceleration methods usually require extensive training and are not universally applicable. LCM-LoRA, trainable once for diverse models, offers universality but rarely considers ensuring the consistency of generated content before and after acceleration. This paper proposes SpeedUpNet (SUN), an innovative acceleration module, to address the challenges of universality and consistency. Exploiting the role of cross-attention layers in U-Net for SD models, we introduce an adapter specifically designed for these layers, quantifying the offset in image generation caused by negative prompts relative to positive prompts. This learned offset demonstrates stability across a range of models, enhancing SUN's universality. To improve output consistency, we propose a Multi-Step Consistency (MSC) loss, which stabilizes the offset and ensures fidelity in accelerated content. Experiments on SD v1.5 show that SUN leads to an overall speedup of more than 10 times compared to the baseline 25-step DPM-solver++, and offers two extra advantages: (1) training-free integration into various fine-tuned Stable-Diffusion models and (2) state-of-the-art FIDs of the generated data set before and after acceleration guided by random combinations of positive and negative prompts. Code is available: https://williechai.github.io/speedup-plugin-for-stable-diffusions.github.io.
Paper Structure (24 sections, 14 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 14 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Visualization of offset between positive and negative guidances. While finetuned SD can generate images of very different styles, the substraction of predictions guided by positive and negative text (offset) is relatively consistent in different SDs.
  • Figure 2: The overall framework of the proposed SUN. SUN adapter is introduced to process and understand the negative prompt, which consists of several cross attention (CA) blocks. Each CA of SUN is placed side by side on each block of the original U-Net. Each block introduces a new K matrix and a V matrix, while sharing the Q with the original U-Net. Attention Normalization technique is proposed for stablability.
  • Figure 3: An illustraion of Muiti-step Consistency (MSC). When distilling a faster student model, teacher-student discrepancy exists and gradually accumulates, causing the content generated by the student to be inconsistent with the teacher (from the same noise). Based on the step distillation method, MSC is used to train the student to approach the teacher's trajectory even when error occurs, thus ensuring consistency in muiti-step samplings.
  • Figure 4: Generation comparisons with different SOTA methods on different numbers of diffusion steps. The proposed SUN can produce high-quality images with only a few steps. In addition, the proposed SUN achieves the highest consistency to the ground truth with only 4 steps.
  • Figure 5: The proposed SUN maintains the controllability of negative prompts when eliminating the need for CFG.
  • ...and 4 more figures