A Diagonal Structured State Space Model on Loihi 2 for Efficient Streaming Sequence Processing
Svea Marie Meyer, Philipp Weidel, Philipp Plank, Leobardo Campos-Macias, Sumit Bam Shrestha, Philipp Stratmann, Mathis Richter
TL;DR
The paper demonstrates the first neuromorphic hardware implementation of a Deep State-Space Model (S4D) on Intel Loihi 2 and benchmarks it against a Jetson Orin Nano for streaming sequence tasks. By exploiting diagonal state-space representations and neuromorphic hardware, the Loihi 2 implementation achieves dramatically lower energy and latency in token-by-token inference, enabling real-time streaming processing. Quantization-aware techniques (PTQ and QAFT) are employed to maintain accuracy on fixed-precision Loihi 2 hardware, with competitive results on MNIST-family datasets and some degradation on CIFAR that QAFT can mitigate. The findings show a clear division of labor: neuromorphic Loihi 2 shines in streaming scenarios, while GPUs remain preferable for offline batched processing, opening pathways for energy-efficient real-time sequence processing in latency- and energy-constrained deployments.
Abstract
Deep State-Space Models (SSM) demonstrate state-of-the art performance on long-range sequence modeling tasks. While the recurrent structure of SSMs can be efficiently implemented as a convolution or as a parallel scan during training, recurrent token-by-token processing cannot currently be implemented efficiently on GPUs. Here, we demonstrate efficient token-by-token inference of the SSM S4D on Intel's Loihi 2 state-of-the-art neuromorphic processor. We compare this first ever neuromorphic-hardware implementation of an SSM on sMNIST, psMNIST, and sCIFAR to a recurrent and a convolutional implementation of S4D on Jetson Orin Nano (Jetson). While we find Jetson to perform better in an offline sample-by-sample based batched processing mode, Loihi 2 outperforms during token-by-token based processing, where it consumes 1000 times less energy with a 75 times lower latency and a 75 times higher throughput compared to the recurrent implementation of S4D on Jetson. This opens up new avenues towards efficient real-time streaming applications of SSMs.
