Table of Contents
Fetching ...

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

Xingrun Xing, Boyan Gao, Zheng Zhang, David A. Clifton, Shitao Xiao, Li Du, Guoqi Li, Jiajun Zhang

TL;DR

This work tackles the high energy cost of inference in large language models by introducing SpikeLLM, the first spiking large language model scaled to 7–70B parameters. It combines Generalized Integrate-and-Fire (GIF) neurons, which compress spike length from $T$ to $\frac{T}{L}\log_2 L$, with Optimal Brain Spiking (OBSpiking), which allocates multistep spikes based on per-channel saliency to approach $\log_2 T$ bit encoding. In experiments, SpikeLLM improves perplexity and zero-shot reasoning over conventional quantization pipelines (e.g., a $11.01\%$ perplexity reduction on WikiText2 and $2.55\%$ accuracy gain for LLAMA-7B in W4A4), and supports additive LLMs with ternary GIF neurons that enable fully additive linear layers. The approach demonstrates a viable path toward energy-efficient, spike-driven LLMs that can outperform traditional quantization in both accuracy and hardware efficiency, though a performance gap to full-precision ANN-LLMs remains and further pretraining and hardware specialization could close it.

Abstract

Recent advancements in large language models (LLMs) with billions of parameters have improved performance in various applications, but their inference processes demand significant energy and computational resources. In contrast, the human brain, with approximately 86 billion neurons, is much more energy-efficient than LLMs with similar parameters. Inspired by this, we redesign 7$\sim$70 billion parameter LLMs using bio-plausible spiking mechanisms, emulating the efficient behavior of the human brain. We propose the first spiking large language model, SpikeLLM. Coupled with the proposed model, two essential approaches are proposed to improve spike training efficiency: Generalized Integrate-and-Fire (GIF) neurons to compress spike length from $T$ to $\frac{T}{L} \log_2 L$ bits, and an Optimal Brain Spiking framework to divide outlier channels and allocate different $T$ for GIF neurons, which further compresses spike length to approximate $log_2T$ bits. The necessity of spike-driven LLM is proved by comparison with quantized LLMs with similar operations. In the OmniQuant pipeline, SpikeLLM reduces 11.01% WikiText2 perplexity and improves 2.55% accuracy of common scene reasoning on a LLAMA-7B W4A4 model. In the GPTQ pipeline, SpikeLLM achieves direct additive in linear layers, significantly exceeding PB-LLMs.

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

TL;DR

This work tackles the high energy cost of inference in large language models by introducing SpikeLLM, the first spiking large language model scaled to 7–70B parameters. It combines Generalized Integrate-and-Fire (GIF) neurons, which compress spike length from to , with Optimal Brain Spiking (OBSpiking), which allocates multistep spikes based on per-channel saliency to approach bit encoding. In experiments, SpikeLLM improves perplexity and zero-shot reasoning over conventional quantization pipelines (e.g., a perplexity reduction on WikiText2 and accuracy gain for LLAMA-7B in W4A4), and supports additive LLMs with ternary GIF neurons that enable fully additive linear layers. The approach demonstrates a viable path toward energy-efficient, spike-driven LLMs that can outperform traditional quantization in both accuracy and hardware efficiency, though a performance gap to full-precision ANN-LLMs remains and further pretraining and hardware specialization could close it.

Abstract

Recent advancements in large language models (LLMs) with billions of parameters have improved performance in various applications, but their inference processes demand significant energy and computational resources. In contrast, the human brain, with approximately 86 billion neurons, is much more energy-efficient than LLMs with similar parameters. Inspired by this, we redesign 770 billion parameter LLMs using bio-plausible spiking mechanisms, emulating the efficient behavior of the human brain. We propose the first spiking large language model, SpikeLLM. Coupled with the proposed model, two essential approaches are proposed to improve spike training efficiency: Generalized Integrate-and-Fire (GIF) neurons to compress spike length from to bits, and an Optimal Brain Spiking framework to divide outlier channels and allocate different for GIF neurons, which further compresses spike length to approximate bits. The necessity of spike-driven LLM is proved by comparison with quantized LLMs with similar operations. In the OmniQuant pipeline, SpikeLLM reduces 11.01% WikiText2 perplexity and improves 2.55% accuracy of common scene reasoning on a LLAMA-7B W4A4 model. In the GPTQ pipeline, SpikeLLM achieves direct additive in linear layers, significantly exceeding PB-LLMs.
Paper Structure (30 sections, 12 equations, 9 figures, 12 tables)

This paper contains 30 sections, 12 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Different encoding methods. In (a, b), the activation has N channels; each value has T quantization levels. Given salient channels (in blue), mix-precision methods (c) are deployment unfriendly. In spike-driven methods (d), we expand salient channels by spiking dynamics to realize single precision quantization, where T' is spiking steps in salient channels.
  • Figure 2: saliency-Aware spiking mechanisms in SpikeLLM. (Left) Spiking self-attention. Salient channels in the KV caches are encoded by multi-step spikes. (Right) Spiking activations or weights in a linear layer, where saliency is detected by gradient or Hessian metric respectively.
  • Figure 3: Comparisons of different saliency metrics in the first linear layer. (a) Insignificant per-token gradient saliency in activations. (b) Significant per-channel gradient saliency in activations. (c) Significant per-channel Hessian saliency in weights. The horizontal axis represents each channel.
  • Figure 4: Ablation studies in LLAMA-2-7B. Average accuracy (not norm) is reported. (a) Comparison between SpikeLLM and Quantized-ANN with the same operations. (b) Ablations on spiking salient channels in activations and KV-Caches. (c) Ablations on spiking salient channels in weights.
  • Figure 5: Effiency comparisons of $\text{SpikeLLM}_\texttt{Ter}$ and PB-LLM in Wikitext-2, C4 and 6 zero-shot benchmarks. We use the average of equal steps as the operation metric of SNNs and BNNs.
  • ...and 4 more figures

Theorems & Definitions (1)

  • proof