SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

Xingrun Xing; Boyan Gao; Zheng Zhang; David A. Clifton; Shitao Xiao; Li Du; Guoqi Li; Jiajun Zhang

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

Xingrun Xing, Boyan Gao, Zheng Zhang, David A. Clifton, Shitao Xiao, Li Du, Guoqi Li, Jiajun Zhang

TL;DR

This work tackles the high energy cost of inference in large language models by introducing SpikeLLM, the first spiking large language model scaled to 7–70B parameters. It combines Generalized Integrate-and-Fire (GIF) neurons, which compress spike length from $T$ to $\frac{T}{L}\log_2 L$, with Optimal Brain Spiking (OBSpiking), which allocates multistep spikes based on per-channel saliency to approach $\log_2 T$ bit encoding. In experiments, SpikeLLM improves perplexity and zero-shot reasoning over conventional quantization pipelines (e.g., a $11.01\%$ perplexity reduction on WikiText2 and $2.55\%$ accuracy gain for LLAMA-7B in W4A4), and supports additive LLMs with ternary GIF neurons that enable fully additive linear layers. The approach demonstrates a viable path toward energy-efficient, spike-driven LLMs that can outperform traditional quantization in both accuracy and hardware efficiency, though a performance gap to full-precision ANN-LLMs remains and further pretraining and hardware specialization could close it.

Abstract

Recent advancements in large language models (LLMs) with billions of parameters have improved performance in various applications, but their inference processes demand significant energy and computational resources. In contrast, the human brain, with approximately 86 billion neurons, is much more energy-efficient than LLMs with similar parameters. Inspired by this, we redesign 7$\sim$70 billion parameter LLMs using bio-plausible spiking mechanisms, emulating the efficient behavior of the human brain. We propose the first spiking large language model, SpikeLLM. Coupled with the proposed model, two essential approaches are proposed to improve spike training efficiency: Generalized Integrate-and-Fire (GIF) neurons to compress spike length from $T$ to $\frac{T}{L} \log_2 L$ bits, and an Optimal Brain Spiking framework to divide outlier channels and allocate different $T$ for GIF neurons, which further compresses spike length to approximate $log_2T$ bits. The necessity of spike-driven LLM is proved by comparison with quantized LLMs with similar operations. In the OmniQuant pipeline, SpikeLLM reduces 11.01% WikiText2 perplexity and improves 2.55% accuracy of common scene reasoning on a LLAMA-7B W4A4 model. In the GPTQ pipeline, SpikeLLM achieves direct additive in linear layers, significantly exceeding PB-LLMs.

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

TL;DR

, with Optimal Brain Spiking (OBSpiking), which allocates multistep spikes based on per-channel saliency to approach

bit encoding. In experiments, SpikeLLM improves perplexity and zero-shot reasoning over conventional quantization pipelines (e.g., a

perplexity reduction on WikiText2 and

accuracy gain for LLAMA-7B in W4A4), and supports additive LLMs with ternary GIF neurons that enable fully additive linear layers. The approach demonstrates a viable path toward energy-efficient, spike-driven LLMs that can outperform traditional quantization in both accuracy and hardware efficiency, though a performance gap to full-precision ANN-LLMs remains and further pretraining and hardware specialization could close it.

Abstract

70 billion parameter LLMs using bio-plausible spiking mechanisms, emulating the efficient behavior of the human brain. We propose the first spiking large language model, SpikeLLM. Coupled with the proposed model, two essential approaches are proposed to improve spike training efficiency: Generalized Integrate-and-Fire (GIF) neurons to compress spike length from

bits, and an Optimal Brain Spiking framework to divide outlier channels and allocate different

for GIF neurons, which further compresses spike length to approximate

bits. The necessity of spike-driven LLM is proved by comparison with quantized LLMs with similar operations. In the OmniQuant pipeline, SpikeLLM reduces 11.01% WikiText2 perplexity and improves 2.55% accuracy of common scene reasoning on a LLAMA-7B W4A4 model. In the GPTQ pipeline, SpikeLLM achieves direct additive in linear layers, significantly exceeding PB-LLMs.

Paper Structure (30 sections, 12 equations, 9 figures, 12 tables)

This paper contains 30 sections, 12 equations, 9 figures, 12 tables.

Introduction
Related Works
Problem Formulation
Spiking Neuronal Dynamics
Limitations of Traditional Quantization
Spike-Driven Quantization
Generalized Integrate-and-Fire Neuron
Saliency-Aware Spiking Steps
Optimal Brain Spiking
Experiments
Main Results
Additive Spiking LLMs
Conclusion
Appendix
Low-Bit Quantization
...and 15 more sections

Figures (9)

Figure 1: Different encoding methods. In (a, b), the activation has N channels; each value has T quantization levels. Given salient channels (in blue), mix-precision methods (c) are deployment unfriendly. In spike-driven methods (d), we expand salient channels by spiking dynamics to realize single precision quantization, where T' is spiking steps in salient channels.
Figure 2: saliency-Aware spiking mechanisms in SpikeLLM. (Left) Spiking self-attention. Salient channels in the KV caches are encoded by multi-step spikes. (Right) Spiking activations or weights in a linear layer, where saliency is detected by gradient or Hessian metric respectively.
Figure 3: Comparisons of different saliency metrics in the first linear layer. (a) Insignificant per-token gradient saliency in activations. (b) Significant per-channel gradient saliency in activations. (c) Significant per-channel Hessian saliency in weights. The horizontal axis represents each channel.
Figure 4: Ablation studies in LLAMA-2-7B. Average accuracy (not norm) is reported. (a) Comparison between SpikeLLM and Quantized-ANN with the same operations. (b) Ablations on spiking salient channels in activations and KV-Caches. (c) Ablations on spiking salient channels in weights.
Figure 5: Effiency comparisons of $\text{SpikeLLM}_\texttt{Ter}$ and PB-LLM in Wikitext-2, C4 and 6 zero-shot benchmarks. We use the average of equal steps as the operation metric of SNNs and BNNs.
...and 4 more figures

Theorems & Definitions (1)

proof

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

TL;DR

Abstract

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

Authors

TL;DR

Abstract

Table of Contents

Figures (9)

Theorems & Definitions (1)