SVD Contextual Sparsity Predictors for Fast LLM Inference

Georgii Serbin; Kirill Koshkin; Zhongao Sun; Anastasiya Bistrigova; C. C. Korikov

SVD Contextual Sparsity Predictors for Fast LLM Inference

Georgii Serbin, Kirill Koshkin, Zhongao Sun, Anastasiya Bistrigova, C. C. Korikov

Abstract

Contextual sparsity is one of the approaches used to reduce computational complexity in the inference process of large language models (LLMs). Existing techniques for efficient LLM inference acceleration based on contextual sparsity with minimal accuracy degradation require training sparse pattern predictors. This paper presents a framework for accelerating inference of ReGLU-based feed-forward networks (FFNs) within LLMs. The proposed framework provides a fast, training-free method for building sparse pattern predictors using truncation-aware singular value decomposition (SVD) of the gate projection matrix, along with a threshold calibration algorithm, and inference executors supporting conditional computation on CUDA and CANN devices. Experiments on three sparse LLMs with an average activation sparsity level of 90% in the FFNs demonstrate up to a 1.8x reduction in end-to-end decoding time while maintaining less than 1% degradation in benchmark scores on tasks involving complex math and code generation. This work advances the deployment of LLMs on edge devices.

SVD Contextual Sparsity Predictors for Fast LLM Inference

Abstract

Paper Structure (47 sections, 8 theorems, 46 equations, 7 figures, 12 tables, 3 algorithms)

This paper contains 47 sections, 8 theorems, 46 equations, 7 figures, 12 tables, 3 algorithms.

Introduction
Related Work
Preliminaries
SVD-Based Sparsity Predictors
Training-free Predictor Building
Data-Aware SVD-Based Factorization
Threshold Calibration
Runtime Execution
Sparsity Prediction
Gate/up Execution Order
Algorithmic Complexity
Hardware platforms
Experimental Results
Experimental Setup
Hardware
...and 32 more sections

Key Result

Proposition 1.2

For every $x \in \mathbb{R}^d$, and therefore

Figures (7)

Figure 1: Overview of our framework. During the offline stage, sparsity predictors are constructed using low-rank SVD approximation with calibrated per-neuron thresholds. At inference time, SVD-based predictors precede each FFN block to exploit contextual sparsity and accelerate generation.
Figure 2: Distribution of SVD-based predictor outputs for a sample neuron. Although active and inactive states are well separated, zero thresholding causes significant false negatives for truly active neurons.
Figure 3: Sparse inference using SVD-based sparsity predictors. Neurons with positive predictor values are considered active during gate projection. Gate and up projections are executed sequentially to increase sparsity.
Figure 4: Layer-wise ROC AUC score of different predictors averaged across several sequence samples for ProSparse-LLaMA2-7B. AUROC score of 1 indicates perfect separation and AUROC score of 0.5 corresponds to random guessing.
Figure 5: Layer-wise ROC-AUC: data-aware SVD vs. Naive SVD for ProSparse-LLaMA2-7B.
...and 2 more figures

Theorems & Definitions (22)

Definition 1.1: Residual operator
Proposition 1.2: Exact representation of the gating error
proof
Proposition 1.3: Monotonicity w.r.t. biases
proof
Remark 1.4
Lemma 1.5: Residual-based upper bound
proof
Theorem 1.6: Worst-case bound for uniform bias
proof
...and 12 more

SVD Contextual Sparsity Predictors for Fast LLM Inference

Abstract

SVD Contextual Sparsity Predictors for Fast LLM Inference

Authors

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (22)