A 28.6 mJ/iter Stable Diffusion Processor for Text-to-Image Generation with Patch Similarity-based Sparsity Augmentation and Text-based Mixed-Precision

Jiwon Choi; Wooyoung Jo; Seongyon Hong; Beomseok Kwon; Wonhoon Park; Hoi-Jun Yoo

A 28.6 mJ/iter Stable Diffusion Processor for Text-to-Image Generation with Patch Similarity-based Sparsity Augmentation and Text-based Mixed-Precision

Jiwon Choi, Wooyoung Jo, Seongyon Hong, Beomseok Kwon, Wonhoon Park, Hoi-Jun Yoo

TL;DR

This work addresses the energy and memory bottlenecks of Stable Diffusion on mobile devices by targeting EMA-heavy self-attention and compute-heavy FFN blocks. It introduces a hardware-software stack: PSSA with PSXU for aggressive self-attention score compression, TIPS for per-pixel mixed-precision in FFN via cross-attention relevance, and DBSC for efficient mixed-precision CNN/transformer processing. The result is a 28 nm processor delivering up to 3.84 TOPS with 225.6 mW, achieving 28.6 mJ/iteration on MS-COCO and maintaining high image quality (CLIP loss < 0.002, FID 0.16), while reducing EMA-related energy by substantial margins. The combination offers a practical pathway to mobile-ready, energy-efficient text-to-image generation with robust output quality.

Abstract

This paper presents an energy-efficient stable diffusion processor for text-to-image generation. While stable diffusion attained attention for high-quality image synthesis results, its inherent characteristics hinder its deployment on mobile platforms. The proposed processor achieves high throughput and energy efficiency with three key features as solutions: 1) Patch similarity-based sparsity augmentation (PSSA) to reduce external memory access (EMA) energy of self-attention score by 60.3 %, leading to 37.8 % total EMA energy reduction. 2) Text-based important pixel spotting (TIPS) to allow 44.8 % of the FFN layer workload to be processed with low-precision activation. 3) Dual-mode bit-slice core (DBSC) architecture to enhance energy efficiency in FFN layers by 43.0 %. The proposed processor is implemented in 28 nm CMOS technology and achieves 3.84 TOPS peak throughput with 225.6 mW average power consumption. In sum, 28.6 mJ/iteration highly energy-efficient text-to-image generation processor can be achieved at MS-COCO dataset.

A 28.6 mJ/iter Stable Diffusion Processor for Text-to-Image Generation with Patch Similarity-based Sparsity Augmentation and Text-based Mixed-Precision

TL;DR

Abstract

Paper Structure (10 sections, 11 figures, 1 table)

This paper contains 10 sections, 11 figures, 1 table.

Introduction
Overall Architecture
Effective Compression of Self-attention Score
Self-attention Score Bitmap Sparsity Augmentation
Patch Similarity-based XOR Unit (PSXU)
Text-based Mixed-precision Processing
Text-based Important Pixel Spotting
Dual-mode Bit-slice Core (DBSC) Architecture
Implementation Results
Conclusion

Figures (11)

Figure 1: (a) Overview of stable diffusion model. (b) Two main challenges.
Figure 2: Overall architecture.
Figure 3: (a) Patch-wise similarity in SAS. (b) Compression flow of SAS with proposed patch similarity-based sparsity augmentation.
Figure 4: Proposed patch similarity-based XOR unit (PSXU).
Figure 5: Performance of proposed PSSA.
...and 6 more figures

A 28.6 mJ/iter Stable Diffusion Processor for Text-to-Image Generation with Patch Similarity-based Sparsity Augmentation and Text-based Mixed-Precision

TL;DR

Abstract

A 28.6 mJ/iter Stable Diffusion Processor for Text-to-Image Generation with Patch Similarity-based Sparsity Augmentation and Text-based Mixed-Precision

Authors

TL;DR

Abstract

Table of Contents

Figures (11)