Sorbet: A Neuromorphic Hardware-Compatible Transformer-Based Spiking Language Model

Kaiwen Tang; Zhanglu Yan; Weng-Fai Wong

Sorbet: A Neuromorphic Hardware-Compatible Transformer-Based Spiking Language Model

Kaiwen Tang, Zhanglu Yan, Weng-Fai Wong

TL;DR

The paper tackles the challenge of deploying transformer-based language models on neuromorphic hardware for energy-efficient edge inference. It introduces Sorbet, a spiking, transformer-like architecture that replaces softmax and layer normalization with two neuromorphic operators, PTsoftmax and BSPN, and employs knowledge distillation and quantization to produce a highly compressed 1-bit-weight model. Sorbet achieves substantial energy savings on GLUE ($$27.16\times$$) while maintaining competitive accuracy, demonstrating that end-to-end neuromorphic NLP is feasible. The work validates the approach with software simulations and hardware-oriented analyses, and releases code to encourage hardware-aware development for edge NLP.

Abstract

For reasons such as privacy, there are use cases for language models at the edge. This has given rise to small language models targeted for deployment in resource-constrained devices where energy efficiency is critical. Spiking neural networks (SNNs) offer a promising solution due to their energy efficiency, and there are already works on realizing transformer-based models on SNNs. However, key operations like softmax and layer normalization (LN) are difficult to implement on neuromorphic hardware, and many of these early works sidestepped them. To address these challenges, we introduce Sorbet, a transformer-based spiking language model that is more neuromorphic hardware-compatible. Sorbet incorporates a novel shifting-based softmax called PTsoftmax and a Bit Shifting PowerNorm (BSPN), both designed to replace the respective energy-intensive operations. By leveraging knowledge distillation and model quantization, Sorbet achieved a highly compressed binary weight model that maintains competitive performance while achieving $27.16\times$ energy savings compared to BERT. We validate Sorbet through extensive testing on the GLUE benchmark and a series of ablation studies, demonstrating its potential as an energy-efficient solution for language model inference. Our code is publicly available at \href{https://github.com/Kaiwen-Tang/Sorbet}{https://github.com/Kaiwen-Tang/Sorbet}

Sorbet: A Neuromorphic Hardware-Compatible Transformer-Based Spiking Language Model

TL;DR

) while maintaining competitive accuracy, demonstrating that end-to-end neuromorphic NLP is feasible. The work validates the approach with software simulations and hardware-oriented analyses, and releases code to encourage hardware-aware development for edge NLP.

Abstract

energy savings compared to BERT. We validate Sorbet through extensive testing on the GLUE benchmark and a series of ablation studies, demonstrating its potential as an energy-efficient solution for language model inference. Our code is publicly available at \href{https://github.com/Kaiwen-Tang/Sorbet}{https://github.com/Kaiwen-Tang/Sorbet}

Paper Structure (33 sections, 4 theorems, 38 equations, 4 figures, 7 tables, 4 algorithms)

This paper contains 33 sections, 4 theorems, 38 equations, 4 figures, 7 tables, 4 algorithms.

Introduction
Related Work
Transformer-based SNNs
Quantized BERT
Simplified Architecture
Preliminary
Spiking Neural Networks
Spike Neuron Model
Challenges of Adapting Transformers to SNNs
Methods
Bit Shifting PowerNorm
Power-of-Two Softmax
Sorbet Architecture
Training Process
Result
...and 18 more sections

Key Result

Theorem 4.2

The loss $L_{\mathrm{PN}}$ under PowerNorm is bounded by a constant, denoted as $C$. We define the BSPN loss by $\mathcal{L}_{\mathrm{BSPN}}$ also has a bounded gradient w.r.t. $\mathbf{X}_{:,i}$, specifically

Figures (4)

Figure 1: Comparison of the architecture of BERT and Sorbet. (a) The multi-head self-attention block of BERT; (b) The feed-forward network of BERT; (c) The spiking multi-head self-attention of Sorbet; (d) The spiking feed-forward network of Sorbet. All the outputs of $\mathcal{SN}$ are spike trains to ensure Sorbet is multiplication-free. The red-bordered box highlights our proposed operations.
Figure 2: Energy cost of different operations. Each value represents a single execution with an input dimension of 128. Based on 45nm technology, a FIX8 division requires 0.59 pJ, whereas a bit-shift operation requires only 0.024 pJ.
Figure 3: Spike firing rate for the output of each block on SST-2 and STS-B datasets.
Figure 4: Distribution of $\mathbf{S}(\mathbf{X})$ measured from various Sorbet layers. The strictly positive values support the assumption made in \ref{['assump']}.

Theorems & Definitions (8)

Definition 4.1
Theorem 4.2: BSPN Preserves Bounded Gradient
Lemma 4.3: 1-Lipschitz Property of $\Phi(X)$
Lemma 4.4: Effect of BSPN on the Lipschitz Constant of the Loss
Lemma 4.5
proof : Proof of \ref{['thm:bspn-preserve-grad']}
proof : Proof of \ref{['lem:bit-shift-lipschitz']}
proof : Proof of \ref{['lem:ptsfm']}

Sorbet: A Neuromorphic Hardware-Compatible Transformer-Based Spiking Language Model

TL;DR

Abstract

Sorbet: A Neuromorphic Hardware-Compatible Transformer-Based Spiking Language Model

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (8)