Table of Contents
Fetching ...

SVD Contextual Sparsity Predictors for Fast LLM Inference

Georgii Serbin, Kirill Koshkin, Zhongao Sun, Anastasiya Bistrigova, C. C. Korikov

Abstract

Contextual sparsity is one of the approaches used to reduce computational complexity in the inference process of large language models (LLMs). Existing techniques for efficient LLM inference acceleration based on contextual sparsity with minimal accuracy degradation require training sparse pattern predictors. This paper presents a framework for accelerating inference of ReGLU-based feed-forward networks (FFNs) within LLMs. The proposed framework provides a fast, training-free method for building sparse pattern predictors using truncation-aware singular value decomposition (SVD) of the gate projection matrix, along with a threshold calibration algorithm, and inference executors supporting conditional computation on CUDA and CANN devices. Experiments on three sparse LLMs with an average activation sparsity level of 90% in the FFNs demonstrate up to a 1.8x reduction in end-to-end decoding time while maintaining less than 1% degradation in benchmark scores on tasks involving complex math and code generation. This work advances the deployment of LLMs on edge devices.

SVD Contextual Sparsity Predictors for Fast LLM Inference

Abstract

Contextual sparsity is one of the approaches used to reduce computational complexity in the inference process of large language models (LLMs). Existing techniques for efficient LLM inference acceleration based on contextual sparsity with minimal accuracy degradation require training sparse pattern predictors. This paper presents a framework for accelerating inference of ReGLU-based feed-forward networks (FFNs) within LLMs. The proposed framework provides a fast, training-free method for building sparse pattern predictors using truncation-aware singular value decomposition (SVD) of the gate projection matrix, along with a threshold calibration algorithm, and inference executors supporting conditional computation on CUDA and CANN devices. Experiments on three sparse LLMs with an average activation sparsity level of 90% in the FFNs demonstrate up to a 1.8x reduction in end-to-end decoding time while maintaining less than 1% degradation in benchmark scores on tasks involving complex math and code generation. This work advances the deployment of LLMs on edge devices.
Paper Structure (47 sections, 8 theorems, 46 equations, 7 figures, 12 tables, 3 algorithms)

This paper contains 47 sections, 8 theorems, 46 equations, 7 figures, 12 tables, 3 algorithms.

Key Result

Proposition 1.2

For every $x \in \mathbb{R}^d$, and therefore

Figures (7)

  • Figure 1: Overview of our framework. During the offline stage, sparsity predictors are constructed using low-rank SVD approximation with calibrated per-neuron thresholds. At inference time, SVD-based predictors precede each FFN block to exploit contextual sparsity and accelerate generation.
  • Figure 2: Distribution of SVD-based predictor outputs for a sample neuron. Although active and inactive states are well separated, zero thresholding causes significant false negatives for truly active neurons.
  • Figure 3: Sparse inference using SVD-based sparsity predictors. Neurons with positive predictor values are considered active during gate projection. Gate and up projections are executed sequentially to increase sparsity.
  • Figure 4: Layer-wise ROC AUC score of different predictors averaged across several sequence samples for ProSparse-LLaMA2-7B. AUROC score of 1 indicates perfect separation and AUROC score of 0.5 corresponds to random guessing.
  • Figure 5: Layer-wise ROC-AUC: data-aware SVD vs. Naive SVD for ProSparse-LLaMA2-7B.
  • ...and 2 more figures

Theorems & Definitions (22)

  • Definition 1.1: Residual operator
  • Proposition 1.2: Exact representation of the gating error
  • proof
  • Proposition 1.3: Monotonicity w.r.t. biases
  • proof
  • Remark 1.4
  • Lemma 1.5: Residual-based upper bound
  • proof
  • Theorem 1.6: Worst-case bound for uniform bias
  • proof
  • ...and 12 more