Table of Contents
Fetching ...

Post-Training Statistical Calibration for Higher Activation Sparsity

Vui Seng Chua, Yujie Pan, Nilesh Jain

TL;DR

This work tackles activation sparsity in large language models by moving beyond post-activation pruning on ReLU-based paths to a generalized, post-training scheme that prunes input activations to all FC layers in Transformer blocks. It introduces Statistical Calibrated Activation Pruning (SCAP), featuring Mode-Centering to align activation distributions for more effective $L_{1}$-thresholding and a unified SCAP_FC kernel that avoids sparsity predictors. Empirically, SCAP delivers a materially better Pareto frontier than CATS (e.g., up to $48.5\%$ FFN sparsity with only $-1.5\%$ accuracy loss) and up to $1.5\times$ decoding speedup across several model families, including Mistral-7B and Llama-2-7B, and extends to MoE, Mamba2, and Vision Transformers without retraining. The method is practical and scalable, with open-source code, enabling faster, more affordable deployment of large models on standard hardware.

Abstract

We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5x additional LLM decoding speedup against CATS at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability. The code is available at: https://github.com/IntelLabs/SCAP.

Post-Training Statistical Calibration for Higher Activation Sparsity

TL;DR

This work tackles activation sparsity in large language models by moving beyond post-activation pruning on ReLU-based paths to a generalized, post-training scheme that prunes input activations to all FC layers in Transformer blocks. It introduces Statistical Calibrated Activation Pruning (SCAP), featuring Mode-Centering to align activation distributions for more effective -thresholding and a unified SCAP_FC kernel that avoids sparsity predictors. Empirically, SCAP delivers a materially better Pareto frontier than CATS (e.g., up to FFN sparsity with only accuracy loss) and up to decoding speedup across several model families, including Mistral-7B and Llama-2-7B, and extends to MoE, Mamba2, and Vision Transformers without retraining. The method is practical and scalable, with open-source code, enabling faster, more affordable deployment of large models on standard hardware.

Abstract

We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5x additional LLM decoding speedup against CATS at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability. The code is available at: https://github.com/IntelLabs/SCAP.

Paper Structure

This paper contains 18 sections, 6 equations, 11 figures, 6 tables, 3 algorithms.

Figures (11)

  • Figure 1: Decline usage of ReLU in recent LLMs
  • Figure 2: Activation Sparsification across methods on SwiGLU
  • Figure 3:
  • Figure 7: Effect of Mode-Centering Calibration on Activation Sparsity
  • Figure 8: Computational graph of an FC layer with mode-centered and pruned input activation
  • ...and 6 more figures