Table of Contents
Fetching ...

DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs

Ruokai Yin, Yuhang Li, Donghyun Lee, Priyadarshini Panda

TL;DR

DuoGPT tackles the high memory and compute costs of LLMs by introducing a training-free dual-sparsity approach that couples unstructured weight pruning with runtime activation sparsity. It reinterprets activation sparsity as dynamic structured weight sparsity and extends the OBC framework with activation-aware calibration and residual corrections, enabling efficient spMspV workloads. A key contribution is an efficient, GPU-friendly pruning calibration algorithm with Hessian synchronization and a closed-form update, yielding up to 9.17% accuracy gains at iso-speedup ~1.39× on LLaMA-2/3, and enabling calibration of 70B models on a single 80GB A100 in about 2 hours. The method demonstrates strong empirical gains across multiple models and tasks, scales to billion-parameter LLMs, and complements existing activation-pruning techniques, offering a practical path toward faster, more memory-efficient LLM deployment while highlighting areas for future GPU kernel development for dual-sparse workloads.

Abstract

Large language models (LLMs) deliver strong performance but are difficult to deploy due to high memory and compute costs. While pruning reduces these demands, most methods ignore activation sparsity observed at runtime. We reinterpret activation sparsity as dynamic structured weight sparsity and propose DuoGPT, a unified framework that constructs dual-sparse (spMspV) workloads by combining unstructured weight pruning with activation sparsity. To preserve accuracy, we extend the Optimal Brain Compression (OBC) framework with activation-aware calibration and introduce output residuals from the dense model as correction terms. We further optimize the solution for efficient GPU execution, enabling scalability to billion-parameter LLMs. Evaluations on LLaMA-2 and LLaMA-3 show that DuoGPT outperforms state-of-the-art structured pruning methods by up to 9.17% accuracy at an iso-speedup of 1.39$\times$ compared to the baseline dense model. Code is available at Github.

DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs

TL;DR

DuoGPT tackles the high memory and compute costs of LLMs by introducing a training-free dual-sparsity approach that couples unstructured weight pruning with runtime activation sparsity. It reinterprets activation sparsity as dynamic structured weight sparsity and extends the OBC framework with activation-aware calibration and residual corrections, enabling efficient spMspV workloads. A key contribution is an efficient, GPU-friendly pruning calibration algorithm with Hessian synchronization and a closed-form update, yielding up to 9.17% accuracy gains at iso-speedup ~1.39× on LLaMA-2/3, and enabling calibration of 70B models on a single 80GB A100 in about 2 hours. The method demonstrates strong empirical gains across multiple models and tasks, scales to billion-parameter LLMs, and complements existing activation-pruning techniques, offering a practical path toward faster, more memory-efficient LLM deployment while highlighting areas for future GPU kernel development for dual-sparse workloads.

Abstract

Large language models (LLMs) deliver strong performance but are difficult to deploy due to high memory and compute costs. While pruning reduces these demands, most methods ignore activation sparsity observed at runtime. We reinterpret activation sparsity as dynamic structured weight sparsity and propose DuoGPT, a unified framework that constructs dual-sparse (spMspV) workloads by combining unstructured weight pruning with activation sparsity. To preserve accuracy, we extend the Optimal Brain Compression (OBC) framework with activation-aware calibration and introduce output residuals from the dense model as correction terms. We further optimize the solution for efficient GPU execution, enabling scalability to billion-parameter LLMs. Evaluations on LLaMA-2 and LLaMA-3 show that DuoGPT outperforms state-of-the-art structured pruning methods by up to 9.17% accuracy at an iso-speedup of 1.39 compared to the baseline dense model. Code is available at Github.

Paper Structure

This paper contains 29 sections, 1 theorem, 33 equations, 6 figures, 11 tables, 1 algorithm.

Key Result

Theorem 1

Under the activation sparsity level $\mathtt{p^x}$ and given ${\mathcal{L}_\text{SparseGPT}}$ and $\mathcal{L}_\text{DuoGPT}$ as the loss functions of two methods during calibration. DuoGPT achieves the following guaranteed loss improvement over SparseGPT: where $\alpha>0$ is the stability constant measuring spectral gap preservation, $\lambda_\text{max}$ is the maximum eigenvalue of the Hessian

Figures (6)

  • Figure 1: GEMV operation for the single-batch decoding stage under different types of sparsity.
  • Figure 2: (a) Illustration of how dual-sparsity accelerates the decoding stage of LLMs by saving computation, memory loading, and storage. (b) Computing paradigm of the DuoGPT's efficient GPU implementation. We neglect the element-wise division of $\mathrm{diag}(\mathbf{L})$ for $\mathbf{c}$ in the figure.
  • Figure 3: Number of weights of LLaMA-2-7B loaded to SRAM. SWS=structured weight sparsity, AS=activation sparsity, and DS=dual-sparsity.
  • Figure 4: Mean zero-shot accuracy and perplexity on LLaMA-3-8B, LLaMA-2-7B, and LLaMA-2-13B. The results are reported across different dual-sparsity levels. The perplexity is reported for WikiText2 dataset and the accuracy results are averaged across 7 tasks.
  • Figure 5: The effect of the C4 calibration set size and sequence length on PPL of WikiText2 dataset for LLaMA-2-7B.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1