SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference
Jiho Shin, Hoeseok Yang, Youngmin Yi
TL;DR
SparseInfer introduces a training-free, sign-bit–based predictor to exploit activation sparsity in ReLU-fied LLMs, bypassing the need for learned sparsity predictors and enabling hardware-agnostic acceleration. By predicting which inner-product rows will be zero through MSB XOR signs and a tunable conservativeness parameter $\alpha$, SparseInfer achieves substantial end-to-end speedups with negligible accuracy degradation. The method is implemented in CUDA with kernel fusion and a dedicated sparse GEMV, and demonstrates approximately $1.6$–$1.8\times$ speedups over a strong llama.cpp baseline and up to $1.3\times$ over PowerInfer on Jetson Orin, while reducing memory usage by over 4×. These results indicate a practical, training-free design space for deploying faster, activation-sparsity-aware LLM inference across heterogeneous hardware.
Abstract
Leveraging sparsity is crucial for optimizing large language model inference. however, modern LLMs employing SiLU as their activation function exhibit minimal activation sparsity. Recent research has proposed replacing SiLU with ReLU to induce significant activation sparsity and showed no downstream task accuracy degradation through fine tuning. However, taking full advantage of it required training a predictor to estimate this sparsity. In this paper, we introduce SparseInfer, a simple, light weight, and training free predictor for activation sparsity of ReLU field LLMs, in which activation sparsity is predicted by comparing only the sign bits of inputs and weights. To compensate for possible prediction inaccuracy, an adaptive tuning of the predictor's conservativeness is enabled, which can also serve as a control knob for optimizing LLM inference. The proposed method achieves approximately faster inference speed over the state of the art, with negligible accuracy loss of within 1%p.
