Table of Contents
Fetching ...

E$^3$-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models

Tao Yuan, Haoli Bai, Yinfei Pan, Xuyang Cao, Tianyu Zhang, Lu Hou, Ting Hu, Xianzhi Yu

TL;DR

E3-Pruner addresses the practical deployment gaps in layer pruning for large language models by combining a differentiable Gumbel-TopK mask search with an adaptive, token-aware knowledge distillation stage. The two-stage framework identifies which layers to prune while preserving task performance and reducing training cost, achieving fast inference with modest data budgets. Extensive experiments across diverse models demonstrate superior accuracy and speedups compared with state-of-the-art baselines, including robust performance under heavy pruning on reasoning benchmarks. The approach offers a scalable, deployment-friendly path for compressing large transformers without sacrificing critical reasoning and downstream capabilities.

Abstract

With the increasing size of large language models, layer pruning has gained increased attention as a hardware-friendly approach for model compression. However, existing layer pruning methods struggle to simultaneously address key practical deployment challenges, including performance degradation, high training costs, and limited acceleration. To overcome these limitations, we propose \name, a task-\underline{E}ffective, training-\underline{E}conomical and inference-\underline{E}fficient layer pruning framework. \namespace introduces two key innovations: (1) a differentiable mask optimization method using a Gumbel-TopK sampler, enabling efficient and precise pruning mask search; and (2) an entropy-aware adaptive knowledge distillation strategy that enhances task performance. Extensive experiments over diverse model architectures and benchmarks demonstrate the superiority of our method over state-of-the-art approaches. Notably, \namespace achieves 96\% accuracy, a mere 0.8\% drop from the original model (96.8\%) on MATH-500 when pruning 25\% layers of Qwen3-32B, outperforming existing SOTA (95\%), with a 1.33$\times$ inference speedup by consuming merely 0.5B tokens (0.5\% of the post-training data volume).

E$^3$-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models

TL;DR

E3-Pruner addresses the practical deployment gaps in layer pruning for large language models by combining a differentiable Gumbel-TopK mask search with an adaptive, token-aware knowledge distillation stage. The two-stage framework identifies which layers to prune while preserving task performance and reducing training cost, achieving fast inference with modest data budgets. Extensive experiments across diverse models demonstrate superior accuracy and speedups compared with state-of-the-art baselines, including robust performance under heavy pruning on reasoning benchmarks. The approach offers a scalable, deployment-friendly path for compressing large transformers without sacrificing critical reasoning and downstream capabilities.

Abstract

With the increasing size of large language models, layer pruning has gained increased attention as a hardware-friendly approach for model compression. However, existing layer pruning methods struggle to simultaneously address key practical deployment challenges, including performance degradation, high training costs, and limited acceleration. To overcome these limitations, we propose \name, a task-\underline{E}ffective, training-\underline{E}conomical and inference-\underline{E}fficient layer pruning framework. \namespace introduces two key innovations: (1) a differentiable mask optimization method using a Gumbel-TopK sampler, enabling efficient and precise pruning mask search; and (2) an entropy-aware adaptive knowledge distillation strategy that enhances task performance. Extensive experiments over diverse model architectures and benchmarks demonstrate the superiority of our method over state-of-the-art approaches. Notably, \namespace achieves 96\% accuracy, a mere 0.8\% drop from the original model (96.8\%) on MATH-500 when pruning 25\% layers of Qwen3-32B, outperforming existing SOTA (95\%), with a 1.33 inference speedup by consuming merely 0.5B tokens (0.5\% of the post-training data volume).

Paper Structure

This paper contains 41 sections, 6 equations, 8 figures, 7 tables, 2 algorithms.

Figures (8)

  • Figure 1: Comparisons between E3-Pruner and current state-of-the-art structural pruning models in LLaMA-2-7B with 60% pruning ratio.
  • Figure 2: The comparisons of task effectiveness, training economy and inference efficiency among existing layer pruning methods. All experiments are done on LLaMA-2-7B model under 60% sparsity.
  • Figure 3: The framework of E3-Pruner. We employ KL divergence to initialize proposed Gumbel-TopK sampler, which searches for the optimal mask in a differentiable way. Subsequently, the pruned model undergoes efficient adaptive knowledge distillation to restore its performance.
  • Figure 4: Ablations on Qwen2.5-14B-Instruct (a): Our initialization identifies important layers at the beginning. (b): Proposed components contribute to lower loss in SFT training. (c): Proposed components contribute to lower loss in KD training.
  • Figure 5: Performance improvement breakdown. By applying proposed E3-Pruner, we achieve significant performance improvements across diverse benchmarks on Qwen2.5-14B-Instruct.
  • ...and 3 more figures