Scalable MatMul-free Language Modeling

Rui-Jie Zhu; Yu Zhang; Steven Abreu; Ethan Sifferman; Tyler Sheaves; Yiqiao Wang; Dustin Richmond; Sumit Bam Shrestha; Peng Zhou; Jason K. Eshraghian

Scalable MatMul-free Language Modeling

Rui-Jie Zhu, Yu Zhang, Steven Abreu, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Sumit Bam Shrestha, Peng Zhou, Jason K. Eshraghian

TL;DR

This work introduces a scalable MatMul-free language model that replaces dense MatMul operations with ternary BitLinear layers and a token mixer built around a closed-form recurrent unit (MLGRU), plus a GLU-based channel mixer. By eliminating MatMul in both dense layers and attention-like components, the model achieves strong performance for billion-parameter scales while dramatically reducing memory and compute, enabling efficient GPU training and far lower energy usage during inference. The authors further demonstrate substantial hardware gains by mapping the model to Intel Loihi 2 neuromorphic hardware, achieving high throughput with very low power and showing favorable comparisons to embedded GPUs. Together, the approach offers a viable path to brain-inspired efficiency for large-scale language models and points to a practical route for neuromorphic deployment of LLMs.

Abstract

Large Language Models (LLMs) have fundamentally altered how we approach scaling in machine learning. However, these models pose substantial computational and memory challenges, primarily due to the reliance on matrix multiplication (MatMul) within their attention and feed-forward (FFN) layers. We demonstrate that MatMul operations can be eliminated from LLMs while maintaining strong performance, even at billion-parameter scales. Our MatMul-free models, tested on models up to 2.7B parameters, are comparable to state-of-the-art pre-trained Transformers, and the performance gap narrows as model size increases. Our approach yields significant memory savings: a GPU-efficient implementation reduces memory consumption by up to 61% during training and over 10x during inference. When adapted for a multi-chip neuromorphic system, the model leverages asynchronous processing to achieve 4x higher throughput with 10x less energy than edge GPUs.

Scalable MatMul-free Language Modeling

TL;DR

Abstract

Paper Structure (47 sections, 27 equations, 9 figures, 6 tables)

This paper contains 47 sections, 27 equations, 9 figures, 6 tables.

Main
Building the MatMul-free Language Model
Scaling Analysis
Performance on Downstream Tasks
Deployment on Neuromorphic Hardware
Training Efficiency Optimization on GPU
Inference Efficiency Optimization on GPU
Neuromorphic Computing with Intel Loihi 2
Methods
MatMul-free Dense Layers with Ternary Weights
Hardware-efficient Fused BitLinear Layer
MatMul-free Language Model Architecture
MatMul-free Token Mixer
Revisiting the Gated Recurrent Unit
MatMul-free Linear Gated Recurrent Unit
...and 32 more sections

Figures (9)

Figure 1: Overview of the MatMul-free LM. Left: general architecture of proposed model. Middle-right: Algorithm mapping of a single block across neurocores on a single Loihi2 chip. Top-Right: Multi-chip Dataflow for autoregression and pre-fill. During autoregression, only one chip consumes non-negligible dynamic power dissipation due to the clock-free system. Bottom-right: The MatMul-free LM is deployed on the Hala Point system which consists of 1,152 Loihi 2 chips.
Figure 2: Scaling law comparison between MatMul-free LM and Transformer++ models, depicted through their loss curves. The red lines represent the loss trajectories of the MatMul-free LM, while the blue lines indicate the losses of the Transformer++ models. The star marks the intersection point of the scaling law projection for both model types. MatMul-free LM uses ternary parameters and BF16 activations, whereas Transformer++ uses BF16 parameters and activations.
Figure 3: Performance comparison and analysis of different models and configurations. (a) and (b) show the training performance comparison between Vanilla BitLinear and Fused BitLinear in terms of time and memory consumption as a function of batch size. (c) compares the inference memory consumption and latency between MatMul-free LM and Transformer++ across various model sizes.
Figure 4: Training loss over steps for the MatMul-free Transformer++ and our proposed method in 370M. The MatMul-free Transformer++ fails to converge, while our method successfully converges under the MatMul-free setting.
Figure 5: RTL implementation for running MatMul-free token generation
...and 4 more figures

Scalable MatMul-free Language Modeling

TL;DR

Abstract

Scalable MatMul-free Language Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (9)