SigDLA: A Deep Learning Accelerator Extension for Signal Processing
Fangfa Fu, Wenyu Zhang, Zesong Jiang, Zhiyu Zhu, Guoyu Li, Bing Yang, Cheng Liu, Liyi Xiao, Jinxiang Wang, Huawei Li, Xiaowei Li
TL;DR
The paper addresses the challenge of running signal processing alongside deep learning on IoT processors, where native hardware support for DSP tasks is lacking. It introduces SigDLA, a unified accelerator that sits on top of a conventional DLA and adds a programmable data shuffling fabric between on-chip buffers and the compute array, plus a variable-bitwidth compute path. Key contributions include the data shuffling fabric that converts irregular signal-processing patterns (such as FFT butterflies) into regular tensor operations, a serial 4-bit based compute array capable of 8/16-bit operations, and instructions that drive shuffling and padding; together, they map FFT, FIR, DCT, and DWT to convolution-like workloads. Evaluation shows average speedups of $4.4\times$, $1.4\times$, and $1.52\times$ and energy reductions of $4.82\times$, $3.27\times$, and $2.15\times$, with only $17\%$ more chip area, demonstrating effective, on-chip co-processing for IoT workloads.
Abstract
Deep learning and signal processing are closely correlated in many IoT scenarios such as anomaly detection to empower intelligence of things. Many IoT processors utilize digital signal processors (DSPs) for signal processing and build deep learning frameworks on this basis. While deep learning is usually much more computing-intensive than signal processing, the computing efficiency of deep learning on DSPs is limited due to the lack of native hardware support. In this case, we present a contrary strategy and propose to enable signal processing on top of a classical deep learning accelerator (DLA). With the observation that irregular data patterns such as butterfly operations in FFT are the major barrier that hinders the deployment of signal processing on DLAs, we propose a programmable data shuffling fabric and have it inserted between the input buffer and computing array of DLAs such that the irregular data is reorganized and the processing is converted to be regular. With the online data shuffling, the proposed architecture, SigDLA, can adapt to various signal processing tasks without affecting the deep learning processing. Moreover, we build a reconfigurable computing array to suit the various data width requirements of both signal processing and deep learning. According to our experiments, SigDLA achieves an average performance speedup of 4.4$\times$, 1.4$\times$, and 1.52$\times$, and average energy reduction of 4.82$\times$, 3.27$\times$, and 2.15$\times$ compared to an embedded ARM processor with customized DSP instructions, a DSP processor, and an independent DSP-DLA architecture respectively with 17% more chip area over the original DLAs.
