Explore Activation Sparsity in Recurrent LLMs for Energy-Efficient Neuromorphic Computing
Ivan Knunyants, Maryam Tavakol, Manolis Sifalakis, Yingfu Xu, Amirreza Yousefzadeh, Guangzhi Tang
TL;DR
This work tackles the challenge of deploying LLMs on resource-constrained devices by introducing activation sparsity in recurrent LLMs (R-LLMs) and a training-free thresholding mechanism. An event-based R-LLM is augmented with per-layer activation-thresholds, and a sequential threshold initialization algorithm selects sparsity levels using a small dataset, achieving up to $63\%$ average sparsity with modest performance loss. The approach generalizes to self-attention LLMs such as OPT, matching the effectiveness of training-based fine-tuning while offering substantial GPU-efficiency gains during threshold search. Hardware simulations on the SENECA neuromorphic processor show energy and latency improvements of up to $1.9\times$, enabling low-power, real-time neuromorphic deployment of LLMs. Overall, the paper demonstrates a practical path for on-device LLM adaptation and neuromorphic deployment via training-free, activation-sparsity-driven acceleration.
Abstract
The recent rise of Large Language Models (LLMs) has revolutionized the deep learning field. However, the desire to deploy LLMs on edge devices introduces energy efficiency and latency challenges. Recurrent LLM (R-LLM) architectures have proven effective in mitigating the quadratic complexity of self-attention, making them a potential paradigm for computing on-edge neuromorphic processors. In this work, we propose a low-cost, training-free algorithm to sparsify R-LLMs' activations to enhance energy efficiency on neuromorphic hardware. Our approach capitalizes on the inherent structure of these models, rendering them well-suited for energy-constrained environments. Although primarily designed for R-LLMs, this method can be generalized to other LLM architectures, such as transformers, as demonstrated on the OPT model, achieving comparable sparsity and efficiency improvements. Empirical studies illustrate that our method significantly reduces computational demands while maintaining competitive accuracy across multiple zero-shot learning benchmarks. Additionally, hardware simulations with the SENECA neuromorphic processor underscore notable energy savings and latency improvements. These results pave the way for low-power, real-time neuromorphic deployment of LLMs and demonstrate the feasibility of training-free on-chip adaptation using activation sparsity.
