Table of Contents
Fetching ...

SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

Rui-Jie Zhu, Qihang Zhao, Guoqi Li, Jason K. Eshraghian

TL;DR

SpikeGPT addresses the energy demands of large language models by replacing self-attention with a spiking, recurrent RWKV module, enabling linear-time sequence processing and event-driven computation. It combines Leaky Integrate-and-Fire neurons, binary embeddings, and SRFFN within a SpikeGPT architecture to support language generation and understanding. The model demonstrates competitive NLG and NLU performance at 46M and 216M parameters while achieving notable energy savings on neuromorphic hardware, and scaling experiments suggest continued gains with larger sizes. Overall, the work advances scalable, energy-efficient SNN-based NLP and points to practical neuromorphic implementations for future large-scale models.

Abstract

As the size of large language models continue to scale, so does the computational resources required to run it. Spiking Neural Networks (SNNs) have emerged as an energy-efficient approach to deep learning that leverage sparse and event-driven activations to reduce the computational overhead associated with model inference. While they have become competitive with non-spiking models on many computer vision tasks, SNNs have also proven to be more challenging to train. As a result, their performance lags behind modern deep learning, and we are yet to see the effectiveness of SNNs in language generation. In this paper, inspired by the Receptance Weighted Key Value (RWKV) language model, we successfully implement `SpikeGPT', a generative language model with binary, event-driven spiking activation units. We train the proposed model on two model variants: 45M and 216M parameters. To the best of our knowledge, SpikeGPT is the largest backpropagation-trained SNN model to date, rendering it suitable for both the generation and comprehension of natural language. We achieve this by modifying the transformer block to replace multi-head self attention to reduce quadratic computational complexity O(N^2) to linear complexity O(N) with increasing sequence length. Input tokens are instead streamed in sequentially to our attention mechanism (as with typical SNNs). Our preliminary experiments show that SpikeGPT remains competitive with non-spiking models on tested benchmarks, while maintaining 20x fewer operations when processed on neuromorphic hardware that can leverage sparse, event-driven activations. Our code implementation is available at https://github.com/ridgerchu/SpikeGPT.

SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

TL;DR

SpikeGPT addresses the energy demands of large language models by replacing self-attention with a spiking, recurrent RWKV module, enabling linear-time sequence processing and event-driven computation. It combines Leaky Integrate-and-Fire neurons, binary embeddings, and SRFFN within a SpikeGPT architecture to support language generation and understanding. The model demonstrates competitive NLG and NLU performance at 46M and 216M parameters while achieving notable energy savings on neuromorphic hardware, and scaling experiments suggest continued gains with larger sizes. Overall, the work advances scalable, energy-efficient SNN-based NLP and points to practical neuromorphic implementations for future large-scale models.

Abstract

As the size of large language models continue to scale, so does the computational resources required to run it. Spiking Neural Networks (SNNs) have emerged as an energy-efficient approach to deep learning that leverage sparse and event-driven activations to reduce the computational overhead associated with model inference. While they have become competitive with non-spiking models on many computer vision tasks, SNNs have also proven to be more challenging to train. As a result, their performance lags behind modern deep learning, and we are yet to see the effectiveness of SNNs in language generation. In this paper, inspired by the Receptance Weighted Key Value (RWKV) language model, we successfully implement `SpikeGPT', a generative language model with binary, event-driven spiking activation units. We train the proposed model on two model variants: 45M and 216M parameters. To the best of our knowledge, SpikeGPT is the largest backpropagation-trained SNN model to date, rendering it suitable for both the generation and comprehension of natural language. We achieve this by modifying the transformer block to replace multi-head self attention to reduce quadratic computational complexity O(N^2) to linear complexity O(N) with increasing sequence length. Input tokens are instead streamed in sequentially to our attention mechanism (as with typical SNNs). Our preliminary experiments show that SpikeGPT remains competitive with non-spiking models on tested benchmarks, while maintaining 20x fewer operations when processed on neuromorphic hardware that can leverage sparse, event-driven activations. Our code implementation is available at https://github.com/ridgerchu/SpikeGPT.
Paper Structure (30 sections, 21 equations, 12 figures, 5 tables)

This paper contains 30 sections, 21 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Model Architecture. The left portion displays the block-level structure. The middle and right illustrations demonstrate the Spiking RWKV and Spiking RFFN architectures, respectively. Spiking RWKV serves as a token mixer and Spiking RFFN functions as a channel mixer. These components are arranged in a loop with residual connections in a manner akin to a Transformer architecture.
  • Figure 2: Training SpikeGPT for NLG and NLU tasks.
  • Figure 3: Training Loss for Different Model Sizes on 0.9B Tokens.
  • Figure 4: Visualization of spike and membrane potential of neurons. Figure (a) and (b) depict the membrane potential of the Spiking RWKV layer, while figure (c) and (d) display the spike patterns observed in the SRFFN layer, where each dot represents a spike event.
  • Figure 5: A demonstration of the Parallelized RWKV model adeptly handling computations involving $e^W$ and $e^KV$ through the application of a large-kernel convolution operation. Notably, during the convolution's sliding window process, the model implements a decay mechanism to effectively manage temporal dependencies. For this demonstration, the sequence length is set to $N=4$, and the embedding size is configured to $E=3$.
  • ...and 7 more figures