Scalable MatMul-free Language Modeling
Rui-Jie Zhu, Yu Zhang, Steven Abreu, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Sumit Bam Shrestha, Peng Zhou, Jason K. Eshraghian
TL;DR
This work introduces a scalable MatMul-free language model that replaces dense MatMul operations with ternary BitLinear layers and a token mixer built around a closed-form recurrent unit (MLGRU), plus a GLU-based channel mixer. By eliminating MatMul in both dense layers and attention-like components, the model achieves strong performance for billion-parameter scales while dramatically reducing memory and compute, enabling efficient GPU training and far lower energy usage during inference. The authors further demonstrate substantial hardware gains by mapping the model to Intel Loihi 2 neuromorphic hardware, achieving high throughput with very low power and showing favorable comparisons to embedded GPUs. Together, the approach offers a viable path to brain-inspired efficiency for large-scale language models and points to a practical route for neuromorphic deployment of LLMs.
Abstract
Large Language Models (LLMs) have fundamentally altered how we approach scaling in machine learning. However, these models pose substantial computational and memory challenges, primarily due to the reliance on matrix multiplication (MatMul) within their attention and feed-forward (FFN) layers. We demonstrate that MatMul operations can be eliminated from LLMs while maintaining strong performance, even at billion-parameter scales. Our MatMul-free models, tested on models up to 2.7B parameters, are comparable to state-of-the-art pre-trained Transformers, and the performance gap narrows as model size increases. Our approach yields significant memory savings: a GPU-efficient implementation reduces memory consumption by up to 61% during training and over 10x during inference. When adapted for a multi-chip neuromorphic system, the model leverages asynchronous processing to achieve 4x higher throughput with 10x less energy than edge GPUs.
