UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs
Yizhuo Ding, Wanying Qu, Jiawei Geng, Wenqi Shao, Yanwei Fu
TL;DR
UniPruning addresses the trade-off between fast, local pruning and globally coordinated sparsity by integrating a layer-wise saliency signal with a model-wide budget through mirror descent, without updating weights. It supports both unstructured and N:M pruning, enabling one-shot mask extraction after calibration. Empirical results across multiple LLM families show competitive or superior perplexity and zero-shot accuracy, with ablations confirming the necessity of the local-global coupling and the mirror-descent mechanism. The approach demonstrates robustness across architectures and hardware backends, offering a scalable path for sparse LLM deployment.
Abstract
Large Language Models (LLMs) achieve strong performance across diverse tasks but face prohibitive computational and memory costs. Pruning offers a promising path by inducing sparsity while preserving architectural flexibility. However, existing methods struggle to balance efficiency and robustness: local metric approaches prune layer by layer but often collapse under high sparsity, whereas global feedback methods enforce consistency at the cost of expensive weight updates or restrictive semi-structured formats. We present UniPruning, a unified post-training pruning framework that combines the speed of local saliency metrics with the stability of global coordination, enabled by a mirror descent based optimization, all without updating model weights. UniPruning leverages fast layer-wise scoring and a lightweight global controller to allocate a single sparsity budget, supporting both unstructured and semi-structured N :M pruning within one framework. After a brief calibration, it can generate pruning masks for arbitrary sparsity levels in one shot, and adapts seamlessly to hardware-aware constraints. Extensive experiments on multiple pretrained LLM families and standard benchmarks show that UniPruning consistently delivers competitive or superior perplexity and zero-shot accuracy. Ablation studies further highlight the importance of mirror descent and local saliency anchoring. Overall, UniPruning provides an efficient, principled, and scalable solution for sparsifying large-scale LLMs. Our code is available at: https://github.com/RainbowQTT/UniPruning.
