Table of Contents
Fetching ...

A Low-Power Streaming Speech Enhancement Accelerator For Edge Devices

Ci-Hao Wu, Tian-Sheuan Chang

TL;DR

This work targets real-time streaming speech enhancement on edge devices by addressing the inefficiencies of transformer-based models. It introduces the Time-Frequency Transformer Neural Network (TFTNN) and a hardware accelerator through co-designed model compression and hardware optimization, including domain-aware and streaming-aware pruning, cross-domain masking, BN-based transformers, and softmax-free attention. The TFTNN achieves approximately 93.9% model-size reduction and 94.9% complexity reduction with minimal performance loss, and runs on a 40 nm silicon design consuming only 8.08 mW to process frames in real time. The resulting 1-D MAC-based accelerator stores intermediate feature maps on-chip, minimizes memory I/O, and demonstrates strong area efficiency and practicality for scalable edge-enabled speech processing tasks.

Abstract

Transformer-based speech enhancement models yield impressive results. However, their heterogeneous and complex structure restricts model compression potential, resulting in greater complexity and reduced hardware efficiency. Additionally, these models are not tailored for streaming and low-power applications. Addressing these challenges, this paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application, which reduces 93.9\% of model size by the proposed domain-aware and streaming-aware pruning techniques. The required latency is further reduced with batch normalization-based transformers. Additionally, we employed softmax-free attention, complemented by an extra batch normalization, facilitating simpler hardware design. The tailored hardware accommodates these diverse computing patterns by breaking them down into element-wise multiplication and accumulation (MAC). This is achieved through a 1-D processing array, utilizing configurable SRAM addressing, thereby minimizing hardware complexities and simplifying zero skipping. Using the TSMC 40nm CMOS process, the final implementation requires merely 207.8K gates and 53.75KB SRAM. It consumes only 8.08 mW for real-time inference at a 62.5MHz frequency.

A Low-Power Streaming Speech Enhancement Accelerator For Edge Devices

TL;DR

This work targets real-time streaming speech enhancement on edge devices by addressing the inefficiencies of transformer-based models. It introduces the Time-Frequency Transformer Neural Network (TFTNN) and a hardware accelerator through co-designed model compression and hardware optimization, including domain-aware and streaming-aware pruning, cross-domain masking, BN-based transformers, and softmax-free attention. The TFTNN achieves approximately 93.9% model-size reduction and 94.9% complexity reduction with minimal performance loss, and runs on a 40 nm silicon design consuming only 8.08 mW to process frames in real time. The resulting 1-D MAC-based accelerator stores intermediate feature maps on-chip, minimizes memory I/O, and demonstrates strong area efficiency and practicality for scalable edge-enabled speech processing tasks.

Abstract

Transformer-based speech enhancement models yield impressive results. However, their heterogeneous and complex structure restricts model compression potential, resulting in greater complexity and reduced hardware efficiency. Additionally, these models are not tailored for streaming and low-power applications. Addressing these challenges, this paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application, which reduces 93.9\% of model size by the proposed domain-aware and streaming-aware pruning techniques. The required latency is further reduced with batch normalization-based transformers. Additionally, we employed softmax-free attention, complemented by an extra batch normalization, facilitating simpler hardware design. The tailored hardware accommodates these diverse computing patterns by breaking them down into element-wise multiplication and accumulation (MAC). This is achieved through a 1-D processing array, utilizing configurable SRAM addressing, thereby minimizing hardware complexities and simplifying zero skipping. Using the TSMC 40nm CMOS process, the final implementation requires merely 207.8K gates and 53.75KB SRAM. It consumes only 8.08 mW for real-time inference at a 62.5MHz frequency.

Paper Structure

This paper contains 32 sections, 2 equations, 19 figures, 7 tables.

Figures (19)

  • Figure 1: The parameters and complexity distribution of the two-stage transformer neural network (TSTNN) TSTNN_2021. The computational complexity is calculated with 8K samples per second.
  • Figure 2: (a) Dilated dense Block, and (b) dilated residual block with channel splitting. Each block is a convolution block with kernel size k, stride s and dilation rate d. Each convolution is followed by LN/PReLU in the dilated dense block, and LN/ReLU in the dilated residual block.
  • Figure 3: Transformer (a) with full-band multi-head attention, and (b) without full-band multi-head attention.
  • Figure 4: (a) Original mask module and (b) modified mask module.
  • Figure 5: The weight distribution of PReLU in this model.
  • ...and 14 more figures