Table of Contents
Fetching ...

A Diagonal Structured State Space Model on Loihi 2 for Efficient Streaming Sequence Processing

Svea Marie Meyer, Philipp Weidel, Philipp Plank, Leobardo Campos-Macias, Sumit Bam Shrestha, Philipp Stratmann, Mathis Richter

TL;DR

The paper demonstrates the first neuromorphic hardware implementation of a Deep State-Space Model (S4D) on Intel Loihi 2 and benchmarks it against a Jetson Orin Nano for streaming sequence tasks. By exploiting diagonal state-space representations and neuromorphic hardware, the Loihi 2 implementation achieves dramatically lower energy and latency in token-by-token inference, enabling real-time streaming processing. Quantization-aware techniques (PTQ and QAFT) are employed to maintain accuracy on fixed-precision Loihi 2 hardware, with competitive results on MNIST-family datasets and some degradation on CIFAR that QAFT can mitigate. The findings show a clear division of labor: neuromorphic Loihi 2 shines in streaming scenarios, while GPUs remain preferable for offline batched processing, opening pathways for energy-efficient real-time sequence processing in latency- and energy-constrained deployments.

Abstract

Deep State-Space Models (SSM) demonstrate state-of-the art performance on long-range sequence modeling tasks. While the recurrent structure of SSMs can be efficiently implemented as a convolution or as a parallel scan during training, recurrent token-by-token processing cannot currently be implemented efficiently on GPUs. Here, we demonstrate efficient token-by-token inference of the SSM S4D on Intel's Loihi 2 state-of-the-art neuromorphic processor. We compare this first ever neuromorphic-hardware implementation of an SSM on sMNIST, psMNIST, and sCIFAR to a recurrent and a convolutional implementation of S4D on Jetson Orin Nano (Jetson). While we find Jetson to perform better in an offline sample-by-sample based batched processing mode, Loihi 2 outperforms during token-by-token based processing, where it consumes 1000 times less energy with a 75 times lower latency and a 75 times higher throughput compared to the recurrent implementation of S4D on Jetson. This opens up new avenues towards efficient real-time streaming applications of SSMs.

A Diagonal Structured State Space Model on Loihi 2 for Efficient Streaming Sequence Processing

TL;DR

The paper demonstrates the first neuromorphic hardware implementation of a Deep State-Space Model (S4D) on Intel Loihi 2 and benchmarks it against a Jetson Orin Nano for streaming sequence tasks. By exploiting diagonal state-space representations and neuromorphic hardware, the Loihi 2 implementation achieves dramatically lower energy and latency in token-by-token inference, enabling real-time streaming processing. Quantization-aware techniques (PTQ and QAFT) are employed to maintain accuracy on fixed-precision Loihi 2 hardware, with competitive results on MNIST-family datasets and some degradation on CIFAR that QAFT can mitigate. The findings show a clear division of labor: neuromorphic Loihi 2 shines in streaming scenarios, while GPUs remain preferable for offline batched processing, opening pathways for energy-efficient real-time sequence processing in latency- and energy-constrained deployments.

Abstract

Deep State-Space Models (SSM) demonstrate state-of-the art performance on long-range sequence modeling tasks. While the recurrent structure of SSMs can be efficiently implemented as a convolution or as a parallel scan during training, recurrent token-by-token processing cannot currently be implemented efficiently on GPUs. Here, we demonstrate efficient token-by-token inference of the SSM S4D on Intel's Loihi 2 state-of-the-art neuromorphic processor. We compare this first ever neuromorphic-hardware implementation of an SSM on sMNIST, psMNIST, and sCIFAR to a recurrent and a convolutional implementation of S4D on Jetson Orin Nano (Jetson). While we find Jetson to perform better in an offline sample-by-sample based batched processing mode, Loihi 2 outperforms during token-by-token based processing, where it consumes 1000 times less energy with a 75 times lower latency and a 75 times higher throughput compared to the recurrent implementation of S4D on Jetson. This opens up new avenues towards efficient real-time streaming applications of SSMs.
Paper Structure (15 sections, 1 equation, 2 figures, 2 tables)

This paper contains 15 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Loihi 2 systems of different form factors
  • Figure 2: S4D model architecture as implemented on Loihi 2. Light blue layers refer to connections and dark blue layers to programmable neurons on Loihi 2. Variables above each layer denote the dimensionality of the layer.