Table of Contents
Fetching ...

A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies

Jinchao Zhu, Yuxuan Wang, Siyuan Pan, Pengfei Wan, Di Zhang, Gao Huang

TL;DR

This study designs a model assembly strategy to reconstruct a lightweight model to reconstruct a lightweight model while preserving performance and ensuring semantic stability through distillation and proposes a feature inheritance strategy to accelerate inference by skipping local computations at the block, layer, or unit level within the network structure.

Abstract

The Stable Diffusion Model (SDM) is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation. Despite various attempts at sampler optimization, model distillation, and network quantification, these approaches typically maintain the original network architecture. The extensive parameter scale and substantial computational demands have limited research into adjusting the model architecture. This study focuses on reducing redundant computation in SDM and optimizes the model through both tuning and tuning-free methods. 1) For the tuning method, we design a model assembly strategy to reconstruct a lightweight model while preserving performance through distillation. Second, to mitigate performance loss due to pruning, we incorporate multi-expert conditional convolution (ME-CondConv) into compressed UNets to enhance network performance by increasing capacity without sacrificing speed. Third, we validate the effectiveness of the multi-UNet switching method for improving network speed. 2) For the tuning-free method, we propose a feature inheritance strategy to accelerate inference by skipping local computations at the block, layer, or unit level within the network structure. We also examine multiple sampling modes for feature inheritance at the time-step level. Experiments demonstrate that both the proposed tuning and the tuning-free methods can improve the speed and performance of the SDM. The lightweight model reconstructed by the model assembly strategy increases generation speed by $22.4%$, while the feature inheritance strategy enhances the SDM generation speed by $40.0%$.

A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies

TL;DR

This study designs a model assembly strategy to reconstruct a lightweight model to reconstruct a lightweight model while preserving performance and ensuring semantic stability through distillation and proposes a feature inheritance strategy to accelerate inference by skipping local computations at the block, layer, or unit level within the network structure.

Abstract

The Stable Diffusion Model (SDM) is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation. Despite various attempts at sampler optimization, model distillation, and network quantification, these approaches typically maintain the original network architecture. The extensive parameter scale and substantial computational demands have limited research into adjusting the model architecture. This study focuses on reducing redundant computation in SDM and optimizes the model through both tuning and tuning-free methods. 1) For the tuning method, we design a model assembly strategy to reconstruct a lightweight model while preserving performance through distillation. Second, to mitigate performance loss due to pruning, we incorporate multi-expert conditional convolution (ME-CondConv) into compressed UNets to enhance network performance by increasing capacity without sacrificing speed. Third, we validate the effectiveness of the multi-UNet switching method for improving network speed. 2) For the tuning-free method, we propose a feature inheritance strategy to accelerate inference by skipping local computations at the block, layer, or unit level within the network structure. We also examine multiple sampling modes for feature inheritance at the time-step level. Experiments demonstrate that both the proposed tuning and the tuning-free methods can improve the speed and performance of the SDM. The lightweight model reconstructed by the model assembly strategy increases generation speed by , while the feature inheritance strategy enhances the SDM generation speed by .
Paper Structure (26 sections, 4 equations, 16 figures, 5 tables)

This paper contains 26 sections, 4 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: (a) and (b) illustrate the parameters and computational requirements 2023-ICMLW-bksdm of SDM v1 and v2. (c) and (d) present analysis of the parameters and latency (iPhone 14 Pro, ms) 2023-nips-snapfusion for cross-attention (CA) and ResNet blocks within the UNet of SDM.
  • Figure 2: (a) demonstrates the function of the different components within UNet. The encoder part (blue box) is primarily responsible for understanding the input image while the decoder part (red box) handles the expressive reconstruction of the image. The shallow layers (yellow block) focus on detail optimization while the deep layers (green block) concentrate on semantic optimization. (b) presents a conceptual approach to network structure optimization. The blue squares indicate the layers within each block of the UNet that are retained. In the distillation method, the white dashed squares represent the blocks that have been removed. In the untrained feature inheritance strategy, these dashed squares denote the layers that skip internal calculations.
  • Figure 3: Macro model assembly process. The first step is to compress the original model into a compressed model through distillation. The second step combines the original model's middle part with the compressed model's two sides to form a reconstructed model. The original model is then used as a teacher to distill the reconstructed model further.
  • Figure 4: Specific SDM model assembly process. Step 1: Distillation of the student compressed model. Step 2: The compressed UNet is merged with the original UNet to obtain the reconstructed UNet, which is then distilled again. Deep parts of the reconstructed UNet are frozen during training to ensure stable semantic generation. Step 3: Explore different combinations and distill the reconstructed UNet again. L-task indicates the supervision of the denoising task. L-KD represents distillation supervision at the UNet output position. L-featKD denotes the distillation supervision of each block output. Loss settings refer to 2023-ICMLW-bksdm.
  • Figure 5: Base UNet (a), Small UNet (b), and Tiny UNet (c) are the three compressed UNet structures proposed in 2023-ICMLW-bksdm. The white squares represent the layers have been pruned. (d) illustrates ME-CondConv, which is adopted to expand the capacity of compressed UNets.
  • ...and 11 more figures