Table of Contents
Fetching ...

Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2

Steven Abreu, Sumit Bam Shrestha, Rui-Jie Zhu, Jason Eshraghian

TL;DR

The paper tackles the energy bottlenecks of large language models by co-designing a MatMul-free LLM for Intel Loihi 2, a neuromorphic processor optimized for low-precision, event-driven computation. It presents a 370M parameter architecture that replaces MatMuls with ternary weights, BitLinear layers, RMSNorm, GLU, and an MLGRU token mixer within a Metaformer framework, leveraging weight sparsity. Through hardware-aware quantization to 8-bit weights and 24-bit activations, fixed-point implementations, and a Loihi-specific microcode mapping with layer fusion, the authors demonstrate on-chip execution with energy-efficient inference. Preliminary results indicate up to 3x higher throughput and around 2x lower energy per token on Loihi 2 compared to transformer baselines on edge GPUs, with latency advantages and favorable scaling for longer sequences, highlighting the potential of neuromorphic hardware for efficient edge AI and scalable reasoning models.

Abstract

Large language models (LLMs) deliver impressive performance but require large amounts of energy. In this work, we present a MatMul-free LLM architecture adapted for Intel's neuromorphic processor, Loihi 2. Our approach leverages Loihi 2's support for low-precision, event-driven computation and stateful processing. Our hardware-aware quantized model on GPU demonstrates that a 370M parameter MatMul-free model can be quantized with no accuracy loss. Based on preliminary results, we report up to 3x higher throughput with 2x less energy, compared to transformer-based LLMs on an edge GPU, with significantly better scaling. Further hardware optimizations will increase throughput and decrease energy consumption. These results show the potential of neuromorphic hardware for efficient inference and pave the way for efficient reasoning models capable of generating complex, long-form text rapidly and cost-effectively.

Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2

TL;DR

The paper tackles the energy bottlenecks of large language models by co-designing a MatMul-free LLM for Intel Loihi 2, a neuromorphic processor optimized for low-precision, event-driven computation. It presents a 370M parameter architecture that replaces MatMuls with ternary weights, BitLinear layers, RMSNorm, GLU, and an MLGRU token mixer within a Metaformer framework, leveraging weight sparsity. Through hardware-aware quantization to 8-bit weights and 24-bit activations, fixed-point implementations, and a Loihi-specific microcode mapping with layer fusion, the authors demonstrate on-chip execution with energy-efficient inference. Preliminary results indicate up to 3x higher throughput and around 2x lower energy per token on Loihi 2 compared to transformer baselines on edge GPUs, with latency advantages and favorable scaling for longer sequences, highlighting the potential of neuromorphic hardware for efficient edge AI and scalable reasoning models.

Abstract

Large language models (LLMs) deliver impressive performance but require large amounts of energy. In this work, we present a MatMul-free LLM architecture adapted for Intel's neuromorphic processor, Loihi 2. Our approach leverages Loihi 2's support for low-precision, event-driven computation and stateful processing. Our hardware-aware quantized model on GPU demonstrates that a 370M parameter MatMul-free model can be quantized with no accuracy loss. Based on preliminary results, we report up to 3x higher throughput with 2x less energy, compared to transformer-based LLMs on an edge GPU, with significantly better scaling. Further hardware optimizations will increase throughput and decrease energy consumption. These results show the potential of neuromorphic hardware for efficient inference and pave the way for efficient reasoning models capable of generating complex, long-form text rapidly and cost-effectively.

Paper Structure

This paper contains 23 sections, 11 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Model architecture of the MatMul-free language model from zhu_scalable_2024.
  • Figure 2: Different Loihi 2 systems are available to cover a wide range of applications from the edge to HPC with up to 1 billion neurons.
  • Figure 3: Different execution modes on Loihi 2 that either optimize throughput or latency. In the pipelined mode, a new data point is inserted in each time step, to use all processing cores and maximize the throughput--at the expense of latency because equal time bins $t_0=t_1=\ldots$ are enforced. In the fall-through mode, a new data points is only provided once the last data point has been fully processed with minimum latency. Only a single neuronal layer is active at any step as data travels through the network. The time per step is thus minimized as traffic is reduced and potentially more complex neuronal layers are not updated.
  • Figure 4: Power of one MatMul-free block on a single-chip Loihi 2 system.
  • Figure 5: Scaling of time per step (inversely proportional to throughput, see text), power per chip and energy per token, as more chips are utilized in a 32-chip Alia Point Loihi 2 system. Each block of the MatMul-free LLM is implemented on a single Loihi 2 chip.
  • ...and 2 more figures