Table of Contents
Fetching ...

Open-Source Acceleration of Stable-Diffusion.cpp Deployable on All Devices

Jingxu Ng, Cheng Lv, Pu Zhao, Wei Niu, Juyi Lin, Minzhou Pan, Yun Liang, Yanzhi Wang

TL;DR

The paper addresses the high latency and memory demands of diffusion-based image generation by focusing on the bottleneck 2D convolution in stable-diffusion.cpp (Sdcpp). It introduces a Winograd-based acceleration strategy, coupled with graph-level analysis to exploit locality and parallelism, scatter-store/gather-load optimizations, and dynamic multi-core scheduling, while expanding operator support and quantization for cross-device deployment. The approach yields significant results, with up to $2.76\times$ speedups on individual convolution layers and up to $4.79\times$ end-to-end acceleration for image generation on M1 Pro, and strong gains on M2 Max as well, across SDv1.4/v1.5/v2.1/SDXL/SDXL-Turbo. This work enables open-source, cross-device deployment of accelerated Stable Diffusion on Mac, Android, and AMD platforms, reducing latency and memory footprint while maintaining correct end-to-end results.

Abstract

Stable diffusion plays a crucial role in generating high-quality images. However, image generation is time-consuming and memory-intensive. To address this, stable-diffusion.cpp (Sdcpp) emerges as an efficient inference framework to accelerate the diffusion models. Although it is lightweight, the current implementation of ggml_conv_2d operator in Sdcpp is suboptimal, exhibiting both high inference latency and massive memory usage. To address this, in this work, we present an optimized version of Sdcpp leveraging the Winograd algorithm to accelerate 2D convolution operations, which is the primary bottleneck in the pipeline. By analyzing both dependent and independent computation graphs, we exploit the device's locality and parallelism to achieve substantial performance improvements. Our framework delivers correct end-to-end results across various stable diffusion models, including SDv1.4, v1.5, v2.1, SDXL, and SDXL-Turbo. Our evaluation results demonstrate a speedup up to 2.76x for individual convolutional layers and an inference speedup up to 4.79x for the overall image generation process, compared with the original Sdcpp on M1 pro. Homepage: https://github.com/SealAILab/stable-diffusion-cpp

Open-Source Acceleration of Stable-Diffusion.cpp Deployable on All Devices

TL;DR

The paper addresses the high latency and memory demands of diffusion-based image generation by focusing on the bottleneck 2D convolution in stable-diffusion.cpp (Sdcpp). It introduces a Winograd-based acceleration strategy, coupled with graph-level analysis to exploit locality and parallelism, scatter-store/gather-load optimizations, and dynamic multi-core scheduling, while expanding operator support and quantization for cross-device deployment. The approach yields significant results, with up to speedups on individual convolution layers and up to end-to-end acceleration for image generation on M1 Pro, and strong gains on M2 Max as well, across SDv1.4/v1.5/v2.1/SDXL/SDXL-Turbo. This work enables open-source, cross-device deployment of accelerated Stable Diffusion on Mac, Android, and AMD platforms, reducing latency and memory footprint while maintaining correct end-to-end results.

Abstract

Stable diffusion plays a crucial role in generating high-quality images. However, image generation is time-consuming and memory-intensive. To address this, stable-diffusion.cpp (Sdcpp) emerges as an efficient inference framework to accelerate the diffusion models. Although it is lightweight, the current implementation of ggml_conv_2d operator in Sdcpp is suboptimal, exhibiting both high inference latency and massive memory usage. To address this, in this work, we present an optimized version of Sdcpp leveraging the Winograd algorithm to accelerate 2D convolution operations, which is the primary bottleneck in the pipeline. By analyzing both dependent and independent computation graphs, we exploit the device's locality and parallelism to achieve substantial performance improvements. Our framework delivers correct end-to-end results across various stable diffusion models, including SDv1.4, v1.5, v2.1, SDXL, and SDXL-Turbo. Our evaluation results demonstrate a speedup up to 2.76x for individual convolutional layers and an inference speedup up to 4.79x for the overall image generation process, compared with the original Sdcpp on M1 pro. Homepage: https://github.com/SealAILab/stable-diffusion-cpp

Paper Structure

This paper contains 7 sections, 1 figure.

Figures (1)

  • Figure 1: Visualization examples of the original Sdcpp and ours, with SDXL-Turbo model and 5 steps.