Table of Contents
Fetching ...

Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA

Takuto Ando, Yu Eto, Ayumu Takeuchi, Yasuhiko Nakashima

TL;DR

The paper tackles the energy efficiency challenge of running Whisper ASR on edge devices by implementing Whisper's core dot-product kernel on the IMAX CGLA accelerator. Through hardware/software co-design, an FPGA prototype is used and a 28 nm ASIC projection demonstrates superior energy efficiency (PDP) relative to Jetson AGX Orin and RTX 4090, especially with Q8_0 quantization. The work introduces FP16 and Q8_0 kernels, data-handling optimizations, and an optimal 32 KB LMM configuration to maximize kernel coverage while minimizing static power, achieving a compute-bound realization on IMAX. This study establishes CGRA-like IMAX as a viable, energy-efficient platform for ASR at the edge and outlines directions for scaling to larger Whisper models.

Abstract

The rise of generative AI for tasks like Automatic Speech Recognition (ASR) has created a critical energy consumption challenge. While ASICs offer high efficiency, they lack the programmability to adapt to evolving algorithms. To address this trade-off, we implement and evaluate Whisper's core computational kernel on the IMAX, a general-purpose Coarse-Grained Linear Arrays (CGLAs) accelerator. To our knowledge, this is the first work to execute a Whisper kernel on a CGRA and compare its performance against CPUs and GPUs. Using hardware/software co-design, we evaluate our system via an FPGA prototype and project performance for a 28 nm ASIC. Our results demonstrate superior energy efficiency. The projected ASIC is 1.90x more energy-efficient than the NVIDIA Jetson AGX Orin and 9.83x more than an NVIDIA RTX 4090 for the Q8_0 model. This work positions CGLA as a promising platform for sustainable ASR on power-constrained edge devices.

Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA

TL;DR

The paper tackles the energy efficiency challenge of running Whisper ASR on edge devices by implementing Whisper's core dot-product kernel on the IMAX CGLA accelerator. Through hardware/software co-design, an FPGA prototype is used and a 28 nm ASIC projection demonstrates superior energy efficiency (PDP) relative to Jetson AGX Orin and RTX 4090, especially with Q8_0 quantization. The work introduces FP16 and Q8_0 kernels, data-handling optimizations, and an optimal 32 KB LMM configuration to maximize kernel coverage while minimizing static power, achieving a compute-bound realization on IMAX. This study establishes CGRA-like IMAX as a viable, energy-efficient platform for ASR at the edge and outlines directions for scaling to larger Whisper models.

Abstract

The rise of generative AI for tasks like Automatic Speech Recognition (ASR) has created a critical energy consumption challenge. While ASICs offer high efficiency, they lack the programmability to adapt to evolving algorithms. To address this trade-off, we implement and evaluate Whisper's core computational kernel on the IMAX, a general-purpose Coarse-Grained Linear Arrays (CGLAs) accelerator. To our knowledge, this is the first work to execute a Whisper kernel on a CGRA and compare its performance against CPUs and GPUs. Using hardware/software co-design, we evaluate our system via an FPGA prototype and project performance for a 28 nm ASIC. Our results demonstrate superior energy efficiency. The projected ASIC is 1.90x more energy-efficient than the NVIDIA Jetson AGX Orin and 9.83x more than an NVIDIA RTX 4090 for the Q8_0 model. This work positions CGLA as a promising platform for sustainable ASR on power-constrained edge devices.

Paper Structure

This paper contains 16 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The Whisper ASR processing flow. The highlighted kernels are the computational stages accelerated by IMAX.
  • Figure 2: High-level overview of the IMAX3 architecture, implemented on a multi-FPGA platform with four AMD Versal VPK180 devices
  • Figure 3: Internal structure of an IMAX lane, featuring interleaved PEs and LMMs to improve dataflow.
  • Figure 4: E2E latency comparison by device. The IMAX (28) demonstrates a speedup over the CPU, while GPU remains the fastest.
  • Figure 5: PDP performance comparison by device (lower is better). The IMAX (28) is more energy-efficient than other platforms.
  • ...and 2 more figures