Table of Contents
Fetching ...

An Efficient Sparse Hardware Accelerator for Spike-Driven Transformer

Zhengke Li, Wendong Mao, Siyu Zhang, Qiwei Dong, Zhongfeng Wang

TL;DR

Edge-transformers incur high compute and energy costs. The authors propose a sparse hardware accelerator for Spike-driven Transformer that encodes spike positions to skip non-spike values and enables efficient dual-spike SDSA, maxpooling, and linear computations. Core contributions include Spike Encoding Unit (SEA) with Encoded Spike SRAM (ESS), Spike Maxpooling Unit (SMU), Spike Mask-Add Module (SMAM), and Spike Linear Unit (SLU). On an FPGA implementation, the design achieves up to 307.2 GSOP/s and 25.6 GSOP/W, with throughput improvements up to 13.24× and energy efficiency up to 1.33× over prior SNN accelerators, demonstrating practical edge deployment for Spike-driven Transformers on CIFAR-10 with quantized weights/activations.

Abstract

Recently, large models, such as Vision Transformer and BERT, have garnered significant attention due to their exceptional performance. However, their extensive computational requirements lead to considerable power and hardware resource consumption. Brain-inspired computing, characterized by its spike-driven methods, has emerged as a promising approach for low-power hardware implementation. In this paper, we propose an efficient sparse hardware accelerator for Spike-driven Transformer. We first design a novel encoding method that encodes the position information of valid activations and skips non-spike values. This method enables us to use encoded spikes for executing the calculations of linear, maxpooling and spike-driven self-attention. Compared with the single spike input design of conventional SNN accelerators that primarily focus on convolution-based spiking computations, the specialized module for spike-driven self-attention is unique in its ability to handle dual spike inputs. By exclusively utilizing activated spikes, our design fully exploits the sparsity of Spike-driven Transformer, which diminishes redundant operations, lowers power consumption, and minimizes computational latency. Experimental results indicate that compared to existing SNNs accelerators, our design achieves up to 13.24$\times$ and 1.33$\times$ improvements in terms of throughput and energy efficiency, respectively.

An Efficient Sparse Hardware Accelerator for Spike-Driven Transformer

TL;DR

Edge-transformers incur high compute and energy costs. The authors propose a sparse hardware accelerator for Spike-driven Transformer that encodes spike positions to skip non-spike values and enables efficient dual-spike SDSA, maxpooling, and linear computations. Core contributions include Spike Encoding Unit (SEA) with Encoded Spike SRAM (ESS), Spike Maxpooling Unit (SMU), Spike Mask-Add Module (SMAM), and Spike Linear Unit (SLU). On an FPGA implementation, the design achieves up to 307.2 GSOP/s and 25.6 GSOP/W, with throughput improvements up to 13.24× and energy efficiency up to 1.33× over prior SNN accelerators, demonstrating practical edge deployment for Spike-driven Transformers on CIFAR-10 with quantized weights/activations.

Abstract

Recently, large models, such as Vision Transformer and BERT, have garnered significant attention due to their exceptional performance. However, their extensive computational requirements lead to considerable power and hardware resource consumption. Brain-inspired computing, characterized by its spike-driven methods, has emerged as a promising approach for low-power hardware implementation. In this paper, we propose an efficient sparse hardware accelerator for Spike-driven Transformer. We first design a novel encoding method that encodes the position information of valid activations and skips non-spike values. This method enables us to use encoded spikes for executing the calculations of linear, maxpooling and spike-driven self-attention. Compared with the single spike input design of conventional SNN accelerators that primarily focus on convolution-based spiking computations, the specialized module for spike-driven self-attention is unique in its ability to handle dual spike inputs. By exclusively utilizing activated spikes, our design fully exploits the sparsity of Spike-driven Transformer, which diminishes redundant operations, lowers power consumption, and minimizes computational latency. Experimental results indicate that compared to existing SNNs accelerators, our design achieves up to 13.24 and 1.33 improvements in terms of throughput and energy efficiency, respectively.
Paper Structure (11 sections, 3 equations, 6 figures, 1 table)

This paper contains 11 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overall architecture of our hardware accelerator.
  • Figure 2: The architecture of the proposed SEA. When the output of the adder exceeds the firing threshold $V_{th}$, the current address $Pos[t][0]$ is stored in the ESS. $Temp[t][0]$ and $Spa[t][0]$ represent temporal and spatial input for SEU[0] in timestep $t$.
  • Figure 3: Illustration of the calculation process of the SMU. The red and black boxes represent the two adjacent kernel. "$or$" represents the logical OR.
  • Figure 4: The details of SMAM. (a) Data paths of SMAM. (b) The logic of token-wise accumulation and fire determination. (c) The logic of masking.
  • Figure 5: Illustration of the SLU. (a) The calculation process of the SLU with an input matrix $\mathbf{X}\in \mathbb{R}^{3 \times 2 \times 2}$. (b) The architecture of the SLU. The Saturation-Truncation Module prevents the value from wrapping around to the negative side or the positive side, ensuring the result fits within the specified bit width.
  • ...and 1 more figures