Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity
Susav Shrestha, Brad Settlemyer, Nikoli Dryden, Narasimha Reddy
TL;DR
This work tackles the bottleneck of scaling LLM inference in batched, high-throughput settings by shifting focus from MLP activation sparsity to attention head sparsity as batch size and sequence length grow. It introduces Polar Sparsity, combining dynamic MLP sparsity with Selective Head Attention and sparsity-aware GPU kernels (Selective GEMM and Selective FlashAttention) to enable scalable, low-IO, high-throughput decoding. The authors demonstrate that attention head sparsity remains batch-invariant and leverage this to achieve up to $2.2\times$ end-to-end speedups across large models like OPT and LLaMA families, with accuracy within ~1% of dense baselines. The approach shows strong practical viability for large-scale LLM deployment, providing substantial throughput gains with minimal changes to existing architectures, and opens avenues for further adaptive, task-aware sparsity strategies. $2.2\times$, $5.5\times$, and $2.8\times$ figures are highlighted for end-to-end speedups, sparse GEMM, and sparse FlashAttention respectively, underscoring the method's broad hardware applicability.
Abstract
Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters, shows promise but does not scale to large batch sizes due to union of active neurons quickly approaching dense computation. We introduce Polar Sparsity, highlighting a key shift in sparsity importance from MLP to Attention layers as we scale batch size and sequence length. While MLP layers become more compute-efficient under batching, their sparsity vanishes. In contrast, attention becomes increasingly more expensive at scale, while their head sparsity remains stable and batch-invariant. We develop Selective Head Attention with hardware-efficient, sparsity-aware GPU kernels, delivering up to \(2.2\times\) end-to-end speedups for models like OPT, LLaMA-2 \& 3, Qwen, Mistral across various batch sizes and sequence lengths without compromising accuracy. To our knowledge, this is the first work to demonstrate that contextual sparsity can scale effectively to large batch sizes, delivering substantial inference acceleration with minimal changes, making Polar Sparsity practical for large-scale, high-throughput LLM deployment systems. Our code is available at: https://github.com/susavlsh10/Polar-Sparsity.
