Table of Contents
Fetching ...

Implementation and Evaluation of Stable Diffusion on a General-Purpose CGLA Accelerator

Takuto Ando, Yu Eto, Yasuhiko Nakashima

TL;DR

The paper tackles the challenge of energy-efficient AI inference for diffusion-based image generation by porting the core Stable Diffusion dot-product kernels to the general-purpose IMAX3 CGRA. It adopts the GGML-based stable-diffusion.cpp framework and reuses quantized dot-product kernels from prior LLM work, while introducing specialized dataflows and new instructions to support Q8_0 and Q3_K quantizations. Key findings show that, although the current FPGA prototype with limited offload yields modest end-to-end gains compared to CPU and GPU, ASIC-based projections (up to $840~\mathrm{MHz}$) offer substantial improvements in latency and energy efficiency, with PDP potentially surpassing GPU in some configurations. The study provides concrete architectural guidelines for future IMAX designs and supports the feasibility of energy-efficient, on-device, multi-modal AI accelerators built on a versatile CGLA platform.

Abstract

This paper presents the first implementation and in-depth evaluation of the primary computational kernels from the stable-diffusion.cpp image generation framework on IMAX3, a general-purpose Coarse-Grained Reconfigurable Array (CGRA) accelerator. We designed IMAX3 as a versatile computational platform, and this work assesses its capabilities by executing a demanding image generation workload. We evaluate its performance on a current Field-Programmable Gate Array (FPGA) prototype to establish a baseline and project its potential for a future Application-Specific Integrated Circuit (ASIC) implementation. Our results demonstrate that, despite its general-purpose architecture, IMAX3 achieves promising performance and power efficiency, particularly in its projected ASIC form. This work provides concrete guidelines for future IMAX architectural designs and establishes a foundation for developing next-generation, AI-specialized Coarse-Grained Linear Array (CGLA) accelerators by refining this versatile platform. Ultimately, this achievement contributes to the realization of energy-efficient, on-device, multi-modal AI platforms.

Implementation and Evaluation of Stable Diffusion on a General-Purpose CGLA Accelerator

TL;DR

The paper tackles the challenge of energy-efficient AI inference for diffusion-based image generation by porting the core Stable Diffusion dot-product kernels to the general-purpose IMAX3 CGRA. It adopts the GGML-based stable-diffusion.cpp framework and reuses quantized dot-product kernels from prior LLM work, while introducing specialized dataflows and new instructions to support Q8_0 and Q3_K quantizations. Key findings show that, although the current FPGA prototype with limited offload yields modest end-to-end gains compared to CPU and GPU, ASIC-based projections (up to ) offer substantial improvements in latency and energy efficiency, with PDP potentially surpassing GPU in some configurations. The study provides concrete architectural guidelines for future IMAX designs and supports the feasibility of energy-efficient, on-device, multi-modal AI accelerators built on a versatile CGLA platform.

Abstract

This paper presents the first implementation and in-depth evaluation of the primary computational kernels from the stable-diffusion.cpp image generation framework on IMAX3, a general-purpose Coarse-Grained Reconfigurable Array (CGRA) accelerator. We designed IMAX3 as a versatile computational platform, and this work assesses its capabilities by executing a demanding image generation workload. We evaluate its performance on a current Field-Programmable Gate Array (FPGA) prototype to establish a baseline and project its potential for a future Application-Specific Integrated Circuit (ASIC) implementation. Our results demonstrate that, despite its general-purpose architecture, IMAX3 achieves promising performance and power efficiency, particularly in its projected ASIC form. This work provides concrete guidelines for future IMAX architectural designs and establishes a foundation for developing next-generation, AI-specialized Coarse-Grained Linear Array (CGLA) accelerators by refining this versatile platform. Ultimately, this achievement contributes to the realization of energy-efficient, on-device, multi-modal AI platforms.

Paper Structure

This paper contains 14 sections, 1 equation, 11 figures, 2 tables.

Figures (11)

  • Figure 1: The IMAX3 FPGA prototype, featuring a multi-board configuration with four AMD Versal VPK180 evaluation kits.
  • Figure 2: High-level overview of the IMAX3 system architecture, implemented on a multi-FPGA platform with four AMD Versal VPK180 devices.
  • Figure 3: Processing flow of Q8_0 kernel.
  • Figure 4: Processing flow of Q3_K kernel.
  • Figure 5: Generated images of Q3_K and Q8_0 models.
  • ...and 6 more figures