Implementation and Evaluation of Stable Diffusion on a General-Purpose CGLA Accelerator
Takuto Ando, Yu Eto, Yasuhiko Nakashima
TL;DR
The paper tackles the challenge of energy-efficient AI inference for diffusion-based image generation by porting the core Stable Diffusion dot-product kernels to the general-purpose IMAX3 CGRA. It adopts the GGML-based stable-diffusion.cpp framework and reuses quantized dot-product kernels from prior LLM work, while introducing specialized dataflows and new instructions to support Q8_0 and Q3_K quantizations. Key findings show that, although the current FPGA prototype with limited offload yields modest end-to-end gains compared to CPU and GPU, ASIC-based projections (up to $840~\mathrm{MHz}$) offer substantial improvements in latency and energy efficiency, with PDP potentially surpassing GPU in some configurations. The study provides concrete architectural guidelines for future IMAX designs and supports the feasibility of energy-efficient, on-device, multi-modal AI accelerators built on a versatile CGLA platform.
Abstract
This paper presents the first implementation and in-depth evaluation of the primary computational kernels from the stable-diffusion.cpp image generation framework on IMAX3, a general-purpose Coarse-Grained Reconfigurable Array (CGRA) accelerator. We designed IMAX3 as a versatile computational platform, and this work assesses its capabilities by executing a demanding image generation workload. We evaluate its performance on a current Field-Programmable Gate Array (FPGA) prototype to establish a baseline and project its potential for a future Application-Specific Integrated Circuit (ASIC) implementation. Our results demonstrate that, despite its general-purpose architecture, IMAX3 achieves promising performance and power efficiency, particularly in its projected ASIC form. This work provides concrete guidelines for future IMAX architectural designs and establishes a foundation for developing next-generation, AI-specialized Coarse-Grained Linear Array (CGLA) accelerators by refining this versatile platform. Ultimately, this achievement contributes to the realization of energy-efficient, on-device, multi-modal AI platforms.
