Table of Contents
Fetching ...

Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization

Guanchen Li, Yixing Xu, Zeping Li, Ji Liu, Xuanwu Yin, Dong Li, Emad Barsoum

TL;DR

This work tackles the challenge of end-to-end global structural pruning for large language models by introducing Týr-the-Pruner, which constructs a multi-sparsity supernet through local pruning and then uses evolutionary search to determine an optimal layerwise sparsity distribution under a target sparsity. Key innovations include Taylor/Hessian-based local pruning with weight adjustments, an expectation error accumulation scheme to balance sparse structures, and a distillation-inspired objective guiding sparsity selection within a coarse-to-fine iterative framework. Empirical results show state-of-the-art pruning performance across LLMs (e.g., 97% accuracy retained at 50% sparsity on Llama-3.1-70B) and robust compatibility with quantization and unstructured sparsity, highlighting the method's practical impact for deploying massive models more efficiently. The approach reduces memory demands via disk-based supernet storage and demonstrates scalable pruning that preserves generative and downstream capabilities, representing a meaningful advance in post-training model compression.

Abstract

Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) yet often fails to maintain comparable performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Although global pruning aims to identify an optimal sparse model, intuitive methods typically adopt a two-stage paradigm that first evaluates substructure saliency and then applies global pruning, which ignores inter-structure dependencies and fails to achieve end-to-end optimization. To address these limitations, we propose Týr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that Týr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters. Code will be available at https://github.com/AMD-AGI/Tyr-the-Pruner.

Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization

TL;DR

This work tackles the challenge of end-to-end global structural pruning for large language models by introducing Týr-the-Pruner, which constructs a multi-sparsity supernet through local pruning and then uses evolutionary search to determine an optimal layerwise sparsity distribution under a target sparsity. Key innovations include Taylor/Hessian-based local pruning with weight adjustments, an expectation error accumulation scheme to balance sparse structures, and a distillation-inspired objective guiding sparsity selection within a coarse-to-fine iterative framework. Empirical results show state-of-the-art pruning performance across LLMs (e.g., 97% accuracy retained at 50% sparsity on Llama-3.1-70B) and robust compatibility with quantization and unstructured sparsity, highlighting the method's practical impact for deploying massive models more efficiently. The approach reduces memory demands via disk-based supernet storage and demonstrates scalable pruning that preserves generative and downstream capabilities, representing a meaningful advance in post-training model compression.

Abstract

Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) yet often fails to maintain comparable performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Although global pruning aims to identify an optimal sparse model, intuitive methods typically adopt a two-stage paradigm that first evaluates substructure saliency and then applies global pruning, which ignores inter-structure dependencies and fails to achieve end-to-end optimization. To address these limitations, we propose Týr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that Týr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters. Code will be available at https://github.com/AMD-AGI/Tyr-the-Pruner.

Paper Structure

This paper contains 24 sections, 10 equations, 7 figures, 19 tables, 4 algorithms.

Figures (7)

  • Figure 1: An overview for Týr-the-Pruner. Large language models (a) will be effectively locally pruned across multiple sparsity ratios and constructed into a supernet (b). An iterative prune-and-search strategy will be used to select the optimal sparse structure for each layer while maintaining a target overall sparsity ratio: pruning and sparsity-shift-driven evolutionary search are implemented iteratively with a coarse-to-fine sparsity interval granularity (c). Ultimately, the post-pruned LLM with the optimal sparsity distribution (d) is obtained.
  • Figure 2: Implementing layerwise error accumulation gives a more accurate pruning result than not. Solid lines indicate forward propagation, and dashed lines indicate pruning.
  • Figure 3: Týr-the-Pruner has faster convergence, fewer exploration generations, shorter search time, and better search outcomes compared to the fine-grained search-only approach.
  • Figure 4: Pre- and post-pruning large language model inference benchmarks.
  • Figure 5: Sparsity distribution of Týr-the-Pruner and the search-only strategy on Llama-3.1-8B.
  • ...and 2 more figures