Table of Contents
Fetching ...

Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models

Xuan Ding, Pengyu Tong, Ranjie Duan, Yunjian Zhang, Rui Sun, Yao Zhu

TL;DR

Pruning as a Cooperative Game reframes layer pruning for large language models as a cooperative game among Transformer layers, with model performance as the utility. To overcome the intractability of exact Shapley values, the authors introduce a two-stage approximation: stratified Monte Carlo mask sampling to gather supervision signals and a lightweight surrogate network to predict performance drops for unseen masks, enabling efficient estimation of marginal contributions. This approach preserves inter-layer dependencies, adaptively identifies critical layers, and demonstrates superior perplexity and zero-shot accuracy across Transformer and non-Transformer backbones, while remaining compatible with post-training quantization. The work offers a scalable, principled framework for practical deployment of compressed LLMs with favorable efficiency-accuracy trade-offs.

Abstract

While large language models (LLMs) demonstrate impressive performance across various tasks, their deployment in real-world scenarios is still constrained by high computational demands. Layer-wise pruning, a commonly employed strategy to mitigate inference costs, can partially address this challenge. However, existing approaches generally depend on static heuristic rules and fail to account for the interdependencies among layers, thereby limiting the effectiveness of the pruning process. To this end, this paper proposes a game-theoretic framework that formulates layer pruning as a cooperative game in which each layer acts as a player and model performance serves as the utility. As computing exact Shapley values is computationally infeasible for large language models (LLMs), we propose using a lightweight surrogate network to estimate layer-wise marginal contributions. This network can predict LLM performance for arbitrary layer combinations at a low computational cost. Additionally, we employ stratified Monte Carlo mask sampling to further reduce the cost of Sharpley value estimation. This approach captures inter-layer dependencies and dynamically identifies critical layers for pruning. Extensive experiments demonstrate the consistent superiority of our method in terms of perplexity and zero-shot accuracy, achieving more efficient and effective layer-wise pruning for large language models.

Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models

TL;DR

Pruning as a Cooperative Game reframes layer pruning for large language models as a cooperative game among Transformer layers, with model performance as the utility. To overcome the intractability of exact Shapley values, the authors introduce a two-stage approximation: stratified Monte Carlo mask sampling to gather supervision signals and a lightweight surrogate network to predict performance drops for unseen masks, enabling efficient estimation of marginal contributions. This approach preserves inter-layer dependencies, adaptively identifies critical layers, and demonstrates superior perplexity and zero-shot accuracy across Transformer and non-Transformer backbones, while remaining compatible with post-training quantization. The work offers a scalable, principled framework for practical deployment of compressed LLMs with favorable efficiency-accuracy trade-offs.

Abstract

While large language models (LLMs) demonstrate impressive performance across various tasks, their deployment in real-world scenarios is still constrained by high computational demands. Layer-wise pruning, a commonly employed strategy to mitigate inference costs, can partially address this challenge. However, existing approaches generally depend on static heuristic rules and fail to account for the interdependencies among layers, thereby limiting the effectiveness of the pruning process. To this end, this paper proposes a game-theoretic framework that formulates layer pruning as a cooperative game in which each layer acts as a player and model performance serves as the utility. As computing exact Shapley values is computationally infeasible for large language models (LLMs), we propose using a lightweight surrogate network to estimate layer-wise marginal contributions. This network can predict LLM performance for arbitrary layer combinations at a low computational cost. Additionally, we employ stratified Monte Carlo mask sampling to further reduce the cost of Sharpley value estimation. This approach captures inter-layer dependencies and dynamically identifies critical layers for pruning. Extensive experiments demonstrate the consistent superiority of our method in terms of perplexity and zero-shot accuracy, achieving more efficient and effective layer-wise pruning for large language models.
Paper Structure (42 sections, 5 equations, 7 figures, 26 tables, 2 algorithms)

This paper contains 42 sections, 5 equations, 7 figures, 26 tables, 2 algorithms.

Figures (7)

  • Figure 1: Layer importance is context-dependent under pruning. Both plots are based on the change in PPL before and after pruning on the BookCorpus dataset to rank layer importance. (Left) Bar plot showing the $\Delta$Rank changes for random single-layer pruning, highlighting that some layers' importance shifts more dramatically when others are pruned. (Right) Line plot showing the $\Delta$Rank changes during multi-layer pruning, where the lowest-ranked layer is removed at each step. The five most volatile layers are highlighted in darker colors, reflecting their fluctuating importance.
  • Figure 2: Framework of our method. Both stages use stratified Monte Carlo masks with controlled Hamming weight. Stage one uses calibration data to compute PPL-based scores for training a lightweight surrogate network, and stage two uses the surrogate to efficiently compute Shapley-based layer importance for scalable LLM pruning.
  • Figure 3: Adversarial reasoning accuracy on ANLI. (a) Comparison of pruned models on R1–R3 and average; only top two values per x are labeled. (b) Accuracy across parameter scales (3–12 layers, 6.1B–4.3B); only the average is labeled.
  • Figure 4: Comparison with structured width pruning methods (Wanda-sp, FLAP, LLM-Pruner) on PTB and system efficiency metrics across progressively reduced parameter budgets. Our method consistently achieves the best trade-off between perplexity, latency, throughput, and GPU memory.
  • Figure 5: Evaluation before and after quantization across model sizes: (a) PPL (left / right bars), (b) Latency vs Throughput (circles / squares), and (c) GPU Memory (solid / dashed lines).
  • ...and 2 more figures