Table of Contents
Fetching ...

NeuronSpark: A Spiking Neural Network Language Model with Selective State Space Dynamics

Zhengzheng Tang

Abstract

We ask whether a pure spiking backbone can learn large-scale language modeling from random initialization, without Transformer distillation. We introduce NeuronSpark, a 0.9B-parameter SNN language model trained with next-token prediction and surrogate gradients. The model combines selective state-space spiking dynamics, leakage-current inter-layer communication, PonderNet adaptive timesteps, fused Triton PLIF kernels, and stabilization techniques (residual centering, lateral-inhibition normalization, and natural-gradient compensation). Under a constrained budget (about 1.4B pretraining tokens and 6.5K SFT steps), NeuronSpark-0.9B reaches 3.6 pretraining loss and shows early multi-turn dialogue behavior after SFT. These results support the feasibility of end-to-end language modeling with a pure SNN architecture at this scale.

NeuronSpark: A Spiking Neural Network Language Model with Selective State Space Dynamics

Abstract

We ask whether a pure spiking backbone can learn large-scale language modeling from random initialization, without Transformer distillation. We introduce NeuronSpark, a 0.9B-parameter SNN language model trained with next-token prediction and surrogate gradients. The model combines selective state-space spiking dynamics, leakage-current inter-layer communication, PonderNet adaptive timesteps, fused Triton PLIF kernels, and stabilization techniques (residual centering, lateral-inhibition normalization, and natural-gradient compensation). Under a constrained budget (about 1.4B pretraining tokens and 6.5K SFT steps), NeuronSpark-0.9B reaches 3.6 pretraining loss and shows early multi-turn dialogue behavior after SFT. These results support the feasibility of end-to-end language modeling with a pure SNN architecture at this scale.
Paper Structure (52 sections, 8 equations, 6 figures, 7 tables)

This paper contains 52 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: NeuronSpark architecture overview. The residual stream carries continuous values $\mathbf{h}$; PLIFLeak denotes PLIF neurons with leakage activation $(1-\beta)\cdot V_{\text{post}}$. PonderNet aggregation (applied per sublayer) collapses $K$ frames per token with learned geometric-distribution weights. Inter-layer communication uses floating-point leakage-current signals; binary spikes are internal firing events rather than the default layer-to-layer representation. The decode stage uses uniform $K$-frame mean. Residual centering (subtract per-token mean) is applied before each residual addition.
  • Figure 2: Pretraining loss curve over 85K steps ($\sim$1.4B tokens). Loss decreases from 9.0 to $\sim$3.5. Training throughput: $\sim$960 tokens/sec on 8$\times$ RTX 4090 GPUs.
  • Figure 3: Training loss: final architecture (blue) vs. 9 ablation variants. None of the ablation variants achieves a loss below 7.0.
  • Figure 4: Biological interpretability of NeuronSpark-0.9B-Chat. (a) Per-token E[K]: punctuation receives fewer steps than content words. (b) POS-level E[K]: function words/punctuation are lower by about $\sim$0.7. (c) Per-layer E[K]: SNNBlock increases with depth, SNNFFN stays near 7--8. (d) Learned $\beta$ distribution: 67.3% fast ($<0.9$), 32.7% slow ($\geq 0.9$).
  • Figure 5: Surprisal vs. E[K] (40 Chinese sentences, 541 tokens). (a) The apparent correlation is dominated by BOS tokens: $r=-0.50$ overall, $r=-0.12$ without BOS. (b) Binned E[K] is nearly flat across surprisal, indicating allocation is largely independent of predictive difficulty.
  • ...and 1 more figures