Table of Contents
Fetching ...

SpikingBrain: Spiking Brain-inspired Large Models

Yuqi Pan, Yupeng Feng, Jinghao Zhuang, Siyu Ding, Han Xu, Zehao Liu, Bohan Sun, Yuhong Chou, Xuerui Qiu, Anlin Deng, Anjie Hu, Shurong Wang, Peng Zhou, Man Yao, Jibin Wu, Jian Yang, Guoliang Sun, Bo Xu, Guoqi Li

TL;DR

This work tackles the inefficiencies of Transformer-based LLMs in long-context settings and explores brain-inspired designs for stable, scalable training on non-NVIDIA hardware. By marrying hybrid linear attention, sparse MoE, and adaptive-threshold spiking with a universal conversion pipeline, SpikingBrain achieves near baselines with far lower data requirements and delivers dramatic long-context speedups. The authors validate two models on the MetaX cluster, demonstrating hundreds of GPUs-scale stability, 128k token context, and substantial CPU and accelerator efficiency gains, including energy-efficient event-driven inference. The findings highlight the potential of brain-inspired mechanisms to redefine large-model scalability and deployment across diverse hardware platforms, including future neuromorphic chips.

Abstract

Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and training remains stable for weeks on hundreds of MetaX GPUs with Model FLOPs Utilization at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.

SpikingBrain: Spiking Brain-inspired Large Models

TL;DR

This work tackles the inefficiencies of Transformer-based LLMs in long-context settings and explores brain-inspired designs for stable, scalable training on non-NVIDIA hardware. By marrying hybrid linear attention, sparse MoE, and adaptive-threshold spiking with a universal conversion pipeline, SpikingBrain achieves near baselines with far lower data requirements and delivers dramatic long-context speedups. The authors validate two models on the MetaX cluster, demonstrating hundreds of GPUs-scale stability, 128k token context, and substantial CPU and accelerator efficiency gains, including energy-efficient event-driven inference. The findings highlight the potential of brain-inspired mechanisms to redefine large-model scalability and deployment across diverse hardware platforms, including future neuromorphic chips.

Abstract

Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and training remains stable for weeks on hundreds of MetaX GPUs with Model FLOPs Utilization at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.

Paper Structure

This paper contains 53 sections, 17 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview of SpikingBrain. Inspired by brain mechanisms, SpikingBrain integrates hybrid efficient attention, MoE modules, and spike encoding into its architecture, supported by a universal conversion pipeline compatible with the open-source model ecosystem. This enables continual pre-training with less than 2% of the data while achieving performance comparable to mainstream open-source models. We further adapt frameworks, operators, parallel strategies, and communication primitives for non-NVIDIA (MetaX) clusters, ensuring stable large-scale training and inference. SpikingBrain achieves over 100× speedup in TTFT for 4M-token sequences, while spiking delivers over 69% sparsity at the micro level. Combined with macro-level MoE sparsity, these advances provide valuable guidance for the design of next-generation neuromorphic chips.
  • Figure 2: Compatibility of SpikingBrain models across diverse computing platforms. SpikingBrain models can be deployed on CPUs and both NVIDIA and non-NVIDIA GPUs using integer activation formats, also inspiring the design of neuromorphic hardware leveraging event-driven sparse spike representations.
  • Figure 3: Integrated architectures of SpikingBrain models. FA: Full Softmax Attention; SWA: Sliding Window Attention; LA: Linear Attention. (Left) SpikingBrain-7B is a linear model with inter-layer hybridization. (Middle) Spike coding converts activations into integer counts for GPU execution or into spike trains for event-driven neuromorphic hardware. (Right) SpikingBrain-76B is a hybrid-linear MoE model with intra-layer hybridization, configured with 128 sink tokens, 16 routed experts, and 1 shared expert. Seven dense FFNs are located at layers $[1,2,3,5,7,9,11]$, with all other FFNs implemented as MoE layers. Attention modules are arranged as "LA + FA" at layers $[7,14,21,28]$, and "LA + SWA" at all other layers.
  • Figure 4: Schematic of three spike coding schemes. (a) An adaptive threshold maps membrane potential to spike counts, which are expanded over virtual timesteps into sparse spike trains, enabling the conversion from continuous activations to discrete spikes. (b) Ternary vs. Binary: binary uses $\{0,1\}$ to represent "spike/no-spike", while ternary uses $\{-1,0,1\}$ to encode both excitatory and inhibitory events. Compared with binary, ternary reduces both timesteps and firing rate by half. (c) Bitwise vs. Ternary: spike counts are unfolded into binary bits across timesteps to achieve temporal compression. In high-count scenarios, the required timesteps are far fewer than ternary, leading to significantly higher efficiency.
  • Figure 5: Operator adaptation of SpikingBrain on MetaX GPUs. The adaptation involves two complementary pathways: Triton adaptation and CUDA migration to MACA framework, covering different operator subsets. Together, they form a unified hardware adaptation framework tailored for MetaX GPUs.
  • ...and 5 more figures