Table of Contents
Fetching ...

WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity

Lei Chen, Yuan Meng, Xiaoyu Zhan, Zhi Wang, Wenwu Zhu

TL;DR

WiSparse tackles the inefficiency of LLM inference by introducing a training-free sparsification framework that jointly accounts for activation magnitudes and weight importance. It combines a weight-aware saliency score with a two-stage mixed-granularity allocation to tailor sparsity across blocks and layers. Across multiple models, WiSparse preserves roughly 97% of dense accuracy at 50% sparsity while delivering up to 21% end-to-end speedups, outperforming existing training-free baselines. This approach advances training-free acceleration for LLM inference by balancing accuracy and efficiency without retraining.

Abstract

Large Language Models (LLMs) offer strong capabilities but incur high inference costs due to dense computation and memory access. Training-free activation sparsity is a promising approach for efficient LLM inference, yet existing methods often rely solely on activation information and uniform sparsity ratios. This overlooks the critical interplay with weights and inter-block sensitivity variation, leading to suboptimal performance. We identify two key phenomena in modern LLMs: 1) less significant activations may align with highly important weights, and 2) sparsity sensitivity varies non-monotonically across model blocks. We propose Weight-aware Mixed-Granularity Training-free Activation Sparsity (WiSparse), which leverages both activation and weight information for adaptive sparsity allocation. Specifically, we introduce a weight-aware mechanism integrating activation magnitudes with precomputed weight norms to accurately identify salient channels. This is combined with a mixed-granularity allocation scheme: a global budget is distributed across blocks via evolutionary search to protect sensitive regions, then refined within blocks to minimize reconstruction error. We improve sparse kernels and demonstrate effectiveness on three representative models. Notably, at 50% sparsity, WiSparse preserves 97% of Llama3.1's dense performance, surpassing the strongest baseline by 2.23 percentage points while achieving a 21.4% acceleration in end-to-end inference speed. Our research advances the limits of training-free approaches for efficient LLM inference, pushing the boundaries of achievable speedup without training.

WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity

TL;DR

WiSparse tackles the inefficiency of LLM inference by introducing a training-free sparsification framework that jointly accounts for activation magnitudes and weight importance. It combines a weight-aware saliency score with a two-stage mixed-granularity allocation to tailor sparsity across blocks and layers. Across multiple models, WiSparse preserves roughly 97% of dense accuracy at 50% sparsity while delivering up to 21% end-to-end speedups, outperforming existing training-free baselines. This approach advances training-free acceleration for LLM inference by balancing accuracy and efficiency without retraining.

Abstract

Large Language Models (LLMs) offer strong capabilities but incur high inference costs due to dense computation and memory access. Training-free activation sparsity is a promising approach for efficient LLM inference, yet existing methods often rely solely on activation information and uniform sparsity ratios. This overlooks the critical interplay with weights and inter-block sensitivity variation, leading to suboptimal performance. We identify two key phenomena in modern LLMs: 1) less significant activations may align with highly important weights, and 2) sparsity sensitivity varies non-monotonically across model blocks. We propose Weight-aware Mixed-Granularity Training-free Activation Sparsity (WiSparse), which leverages both activation and weight information for adaptive sparsity allocation. Specifically, we introduce a weight-aware mechanism integrating activation magnitudes with precomputed weight norms to accurately identify salient channels. This is combined with a mixed-granularity allocation scheme: a global budget is distributed across blocks via evolutionary search to protect sensitive regions, then refined within blocks to minimize reconstruction error. We improve sparse kernels and demonstrate effectiveness on three representative models. Notably, at 50% sparsity, WiSparse preserves 97% of Llama3.1's dense performance, surpassing the strongest baseline by 2.23 percentage points while achieving a 21.4% acceleration in end-to-end inference speed. Our research advances the limits of training-free approaches for efficient LLM inference, pushing the boundaries of achievable speedup without training.
Paper Structure (22 sections, 8 equations, 6 figures, 2 tables, 4 algorithms)

This paper contains 22 sections, 8 equations, 6 figures, 2 tables, 4 algorithms.

Figures (6)

  • Figure 1: The overall framework of WiSparse. The process starts by calculating importance scores for each layer based on activation values and weight norms. These scores generate sparsity masks to prune less important channels. Block-level sparsity is optimized via an evolutionary search on a small calibration dataset, followed by refinement of layer-level sparsities using a greedy allocation strategy. The final configuration is applied during inference to improve computational efficiency by reducing unnecessary operations.
  • Figure 2: Distribution of activation and weight magnitudes for the self_attn.o_proj layer in block 17 of Llama-3.1-8B. The plot shows that channels with low activation magnitudes can have high-magnitude weights (e.g., channel 2244), demonstrating the limitations of using activation-only metrics to assess channel importance.
  • Figure 3: Block-wise sensitivity to sparsification. The plot reports the relative change in validation perplexity ($\Delta$PPL vs the dense model, in %) when sparsifying one block at a time while keeping all other blocks dense. Curves correspond to 40%, 50%, and 60% sparsity.
  • Figure 4: Achieved TLOPS (left) and end-to-end inference speed in tokens/s (right) under different sparsity levels with WiSparse on Llama‑3.1‑8B, Mistral‑7B, and Qwen‑2.5‑7B.
  • Figure 5: Per-block and per-module (self-attention and MLP) sparsity distributions for (a) Llama-3.1-8B and (b) Qwen-2.5-7B, as determined by our search algorithm targeting 50% overall sparsity.
  • ...and 1 more figures