Accelerating Transformer Inference and Training with 2:4 Activation Sparsity
Daniel Haziza, Timothy Chou, Dhruv Choudhary, Luca Wehrstedt, Francisco Massa, Jiecao Yu, Geonhwa Jeong, Supriya Rao, Patrick Labatut, Jesse Cai
TL;DR
The paper addresses the computational bottlenecks of Transformer-scale LLM training and inference by applying $2:4$ activation sparsity to Squared-ReLU FFNs, leveraging intrinsic sparsity to accelerate computations without accuracy loss. It develops a practical FP8-based, three-kernel FFN path and introduces stability-focused optimizations, including a 95/5 feature split and token permutation, to enable effective backward sparsity. Empirical results show minimal accuracy degradation in LLM pretraining and kernel-level speedups up to 1.3x for FFNs (forward/backward) and ~30% for forward passes, highlighting a viable route to accelerate large-scale models. The work provides a scalable framework for integrating activation sparsity with hardware-accelerated 2:4 GEMMs, informing future sparse transformer designs and larger-model deployments.
Abstract
In this paper, we demonstrate how to leverage 2:4 sparsity, a popular hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuracy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.
