Table of Contents
Fetching ...

UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs

Yizhuo Ding, Wanying Qu, Jiawei Geng, Wenqi Shao, Yanwei Fu

TL;DR

UniPruning addresses the trade-off between fast, local pruning and globally coordinated sparsity by integrating a layer-wise saliency signal with a model-wide budget through mirror descent, without updating weights. It supports both unstructured and N:M pruning, enabling one-shot mask extraction after calibration. Empirical results across multiple LLM families show competitive or superior perplexity and zero-shot accuracy, with ablations confirming the necessity of the local-global coupling and the mirror-descent mechanism. The approach demonstrates robustness across architectures and hardware backends, offering a scalable path for sparse LLM deployment.

Abstract

Large Language Models (LLMs) achieve strong performance across diverse tasks but face prohibitive computational and memory costs. Pruning offers a promising path by inducing sparsity while preserving architectural flexibility. However, existing methods struggle to balance efficiency and robustness: local metric approaches prune layer by layer but often collapse under high sparsity, whereas global feedback methods enforce consistency at the cost of expensive weight updates or restrictive semi-structured formats. We present UniPruning, a unified post-training pruning framework that combines the speed of local saliency metrics with the stability of global coordination, enabled by a mirror descent based optimization, all without updating model weights. UniPruning leverages fast layer-wise scoring and a lightweight global controller to allocate a single sparsity budget, supporting both unstructured and semi-structured N :M pruning within one framework. After a brief calibration, it can generate pruning masks for arbitrary sparsity levels in one shot, and adapts seamlessly to hardware-aware constraints. Extensive experiments on multiple pretrained LLM families and standard benchmarks show that UniPruning consistently delivers competitive or superior perplexity and zero-shot accuracy. Ablation studies further highlight the importance of mirror descent and local saliency anchoring. Overall, UniPruning provides an efficient, principled, and scalable solution for sparsifying large-scale LLMs. Our code is available at: https://github.com/RainbowQTT/UniPruning.

UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs

TL;DR

UniPruning addresses the trade-off between fast, local pruning and globally coordinated sparsity by integrating a layer-wise saliency signal with a model-wide budget through mirror descent, without updating weights. It supports both unstructured and N:M pruning, enabling one-shot mask extraction after calibration. Empirical results across multiple LLM families show competitive or superior perplexity and zero-shot accuracy, with ablations confirming the necessity of the local-global coupling and the mirror-descent mechanism. The approach demonstrates robustness across architectures and hardware backends, offering a scalable path for sparse LLM deployment.

Abstract

Large Language Models (LLMs) achieve strong performance across diverse tasks but face prohibitive computational and memory costs. Pruning offers a promising path by inducing sparsity while preserving architectural flexibility. However, existing methods struggle to balance efficiency and robustness: local metric approaches prune layer by layer but often collapse under high sparsity, whereas global feedback methods enforce consistency at the cost of expensive weight updates or restrictive semi-structured formats. We present UniPruning, a unified post-training pruning framework that combines the speed of local saliency metrics with the stability of global coordination, enabled by a mirror descent based optimization, all without updating model weights. UniPruning leverages fast layer-wise scoring and a lightweight global controller to allocate a single sparsity budget, supporting both unstructured and semi-structured N :M pruning within one framework. After a brief calibration, it can generate pruning masks for arbitrary sparsity levels in one shot, and adapts seamlessly to hardware-aware constraints. Extensive experiments on multiple pretrained LLM families and standard benchmarks show that UniPruning consistently delivers competitive or superior perplexity and zero-shot accuracy. Ablation studies further highlight the importance of mirror descent and local saliency anchoring. Overall, UniPruning provides an efficient, principled, and scalable solution for sparsifying large-scale LLMs. Our code is available at: https://github.com/RainbowQTT/UniPruning.

Paper Structure

This paper contains 27 sections, 2 theorems, 37 equations, 3 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Under the above assumptions, if the step size $\alpha$ satisfies then the sequence $\{(W^k, \Gamma^k)\}$ generated by updates Eq. eq:update-gamma converges to a critical point of Eq. eq:energy.

Figures (3)

  • Figure 1: Overall framework of Unified Pruning. The framework targets pruning in two types of layers: MLP layers and attention projection layers. It operates in two stages (a) Search Stage: model weights $W$ are iteratively updated while saliency variables $\Gamma$ are jointly optimized with local metrics $S(W)$ via mirror descent, gradually accumulating pruning signals. (b) Pruning Stage: the final $\Gamma^N$ is projected into unstructured or semi-structured sparsity masks, which are applied to the original pretrained weights $W^0$ to yield sparse models at arbitrary sparsity levels.
  • Figure 2: Wikitext perplexity comparison at 70% sparsity
  • Figure 3: WikiText perplexity of different models at 60% sparsity across $\lambda$ values.

Theorems & Definitions (8)

  • Theorem 1: Global convergence
  • proof
  • Lemma 2
  • proof
  • proof
  • Definition 3
  • Definition 4
  • Definition 5