Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization
Guanchen Li, Yixing Xu, Zeping Li, Ji Liu, Xuanwu Yin, Dong Li, Emad Barsoum
TL;DR
This work tackles the challenge of end-to-end global structural pruning for large language models by introducing Týr-the-Pruner, which constructs a multi-sparsity supernet through local pruning and then uses evolutionary search to determine an optimal layerwise sparsity distribution under a target sparsity. Key innovations include Taylor/Hessian-based local pruning with weight adjustments, an expectation error accumulation scheme to balance sparse structures, and a distillation-inspired objective guiding sparsity selection within a coarse-to-fine iterative framework. Empirical results show state-of-the-art pruning performance across LLMs (e.g., 97% accuracy retained at 50% sparsity on Llama-3.1-70B) and robust compatibility with quantization and unstructured sparsity, highlighting the method's practical impact for deploying massive models more efficiently. The approach reduces memory demands via disk-based supernet storage and demonstrates scalable pruning that preserves generative and downstream capabilities, representing a meaningful advance in post-training model compression.
Abstract
Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) yet often fails to maintain comparable performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Although global pruning aims to identify an optimal sparse model, intuitive methods typically adopt a two-stage paradigm that first evaluates substructure saliency and then applies global pruning, which ignores inter-structure dependencies and fails to achieve end-to-end optimization. To address these limitations, we propose Týr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that Týr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters. Code will be available at https://github.com/AMD-AGI/Tyr-the-Pruner.
