Table of Contents
Fetching ...

Hardware Efficient Accelerator for Spiking Transformer With Reconfigurable Parallel Time Step Computing

Bo-Yu Chen, Tian-Sheuan Chang

TL;DR

This work tackles the energy and latency challenges of Spiking Transformer models by introducing a hardware accelerator that operates entirely on spike inputs and outputs. It replaces residual addition with element-wise IAND and employs a fully parallel tick-batching dataflow together with a reconfigurable, unrolled LIF neuron to enable spatial-temporal parallelism and to eliminate membrane memory. The design supports 3x3 and 1x1 convolutions as well as matrix multiplications via a vectorized dataflow, implemented in TSMC 28nm to achieve up to $3.456$ TSOPS (i.e., $3{,}456$ GSOPS) and $38.334$ TSOPS/W at $500$ MHz, using $198.46$K logic gates and $139.25$KB SRAM. This constitutes the first SNN accelerator for Vision Transformers and demonstrates a practical, energy-efficient path toward edge deployment of spike-based transformers.

Abstract

This paper introduces the first low-power hardware accelerator for Spiking Transformers, an emerging alternative to traditional artificial neural networks. By modifying the base Spikformer model to use IAND instead of residual addition, the model exclusively utilizes spike computation. The hardware employs a fully parallel tick-batching dataflow and a time-step reconfigurable neuron architecture, addressing the delay and power challenges of multi-timestep processing in spiking neural networks. This approach processes outputs from all time steps in parallel, reducing computation delay and eliminating membrane memory, thereby lowering energy consumption. The accelerator supports 3x3 and 1x1 convolutions and matrix operations through vectorized processing, meeting model requirements. Implemented in TSMC's 28nm process, it achieves 3.456 TSOPS (tera spike operations per second) with a power efficiency of 38.334 TSOPS/W at 500MHz, using 198.46K logic gates and 139.25KB of SRAM.

Hardware Efficient Accelerator for Spiking Transformer With Reconfigurable Parallel Time Step Computing

TL;DR

This work tackles the energy and latency challenges of Spiking Transformer models by introducing a hardware accelerator that operates entirely on spike inputs and outputs. It replaces residual addition with element-wise IAND and employs a fully parallel tick-batching dataflow together with a reconfigurable, unrolled LIF neuron to enable spatial-temporal parallelism and to eliminate membrane memory. The design supports 3x3 and 1x1 convolutions as well as matrix multiplications via a vectorized dataflow, implemented in TSMC 28nm to achieve up to TSOPS (i.e., GSOPS) and TSOPS/W at MHz, using K logic gates and KB SRAM. This constitutes the first SNN accelerator for Vision Transformers and demonstrates a practical, energy-efficient path toward edge deployment of spike-based transformers.

Abstract

This paper introduces the first low-power hardware accelerator for Spiking Transformers, an emerging alternative to traditional artificial neural networks. By modifying the base Spikformer model to use IAND instead of residual addition, the model exclusively utilizes spike computation. The hardware employs a fully parallel tick-batching dataflow and a time-step reconfigurable neuron architecture, addressing the delay and power challenges of multi-timestep processing in spiking neural networks. This approach processes outputs from all time steps in parallel, reducing computation delay and eliminating membrane memory, thereby lowering energy consumption. The accelerator supports 3x3 and 1x1 convolutions and matrix operations through vectorized processing, meeting model requirements. Implemented in TSMC's 28nm process, it achieves 3.456 TSOPS (tera spike operations per second) with a power efficiency of 38.334 TSOPS/W at 500MHz, using 198.46K logic gates and 139.25KB of SRAM.

Paper Structure

This paper contains 12 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The proposed system architecture
  • Figure 2: The proposed PE array
  • Figure 3: The proposed reconfigurable unrolled LIF neuron. The MUX selector input (from left to right) will be set to 111/101/000 for the time step=4/2/1, respectively.
  • Figure 4: Data flow of 3x3 convolution