Table of Contents
Fetching ...

Pangu Light: Weight Re-Initialization for Pruning and Accelerating LLMs

Hanting Chen, Jiarui Qin, Jialong Guo, Tao Yuan, Yichun Yin, Huiling Zhen, Yasheng Wang, Jinpeng Li, Xiaojun Meng, Meng Zhang, Rongju Ruan, Zheyuan Bai, Yehui Tang, Can Chen, Xinghao Chen, Fisher Yu, Ruiming Tang, Yunhe Wang

TL;DR

This work tackles the practical challenge of deploying large language models by addressing the instability caused by aggressive joint pruning of width and depth. It introduces Pangu Light, a framework that couples multi-axis structured pruning with dedicated weight re-initialization methods (CLAP for depth, SLNP for width) and normalization optimizations (Post-RMSNorm absorption), plus a DSSN-aware absorption strategy to accelerate inference on Ascend NPUs. The approach is reinforced by knowledge distillation during a recovery phase to reclaim performance, yielding superior accuracy-efficiency trade-offs compared with baselines like Qwen and PUZZLE across reasoning benchmarks. The results demonstrate that principled weight re-initialization, hardware-aware design, and targeted normalization optimization enable substantial acceleration with minimal loss in reasoning capabilities, making practical deployment of large-scale LLMs more feasible.

Abstract

Large Language Models (LLMs) deliver state-of-the-art capabilities across numerous tasks, but their immense size and inference costs pose significant computational challenges for practical deployment. While structured pruning offers a promising avenue for model compression, existing methods often struggle with the detrimental effects of aggressive, simultaneous width and depth reductions, leading to substantial performance degradation. This paper argues that a critical, often overlooked, aspect in making such aggressive joint pruning viable is the strategic re-initialization and adjustment of remaining weights to improve the model post-pruning training accuracies. We introduce Pangu Light, a framework for LLM acceleration centered around structured pruning coupled with novel weight re-initialization techniques designed to address this ``missing piece''. Our framework systematically targets multiple axes, including model width, depth, attention heads, and RMSNorm, with its effectiveness rooted in novel re-initialization methods like Cross-Layer Attention Pruning (CLAP) and Stabilized LayerNorm Pruning (SLNP) that mitigate performance drops by providing the network a better training starting point. Further enhancing efficiency, Pangu Light incorporates specialized optimizations such as absorbing Post-RMSNorm computations and tailors its strategies to Ascend NPU characteristics. The Pangu Light models consistently exhibit a superior accuracy-efficiency trade-off, outperforming prominent baseline pruning methods like Nemotron and established LLMs like Qwen3 series. For instance, on Ascend NPUs, Pangu Light-32B's 81.6 average score and 2585 tokens/s throughput exceed Qwen3-32B's 80.9 average score and 2225 tokens/s.

Pangu Light: Weight Re-Initialization for Pruning and Accelerating LLMs

TL;DR

This work tackles the practical challenge of deploying large language models by addressing the instability caused by aggressive joint pruning of width and depth. It introduces Pangu Light, a framework that couples multi-axis structured pruning with dedicated weight re-initialization methods (CLAP for depth, SLNP for width) and normalization optimizations (Post-RMSNorm absorption), plus a DSSN-aware absorption strategy to accelerate inference on Ascend NPUs. The approach is reinforced by knowledge distillation during a recovery phase to reclaim performance, yielding superior accuracy-efficiency trade-offs compared with baselines like Qwen and PUZZLE across reasoning benchmarks. The results demonstrate that principled weight re-initialization, hardware-aware design, and targeted normalization optimization enable substantial acceleration with minimal loss in reasoning capabilities, making practical deployment of large-scale LLMs more feasible.

Abstract

Large Language Models (LLMs) deliver state-of-the-art capabilities across numerous tasks, but their immense size and inference costs pose significant computational challenges for practical deployment. While structured pruning offers a promising avenue for model compression, existing methods often struggle with the detrimental effects of aggressive, simultaneous width and depth reductions, leading to substantial performance degradation. This paper argues that a critical, often overlooked, aspect in making such aggressive joint pruning viable is the strategic re-initialization and adjustment of remaining weights to improve the model post-pruning training accuracies. We introduce Pangu Light, a framework for LLM acceleration centered around structured pruning coupled with novel weight re-initialization techniques designed to address this ``missing piece''. Our framework systematically targets multiple axes, including model width, depth, attention heads, and RMSNorm, with its effectiveness rooted in novel re-initialization methods like Cross-Layer Attention Pruning (CLAP) and Stabilized LayerNorm Pruning (SLNP) that mitigate performance drops by providing the network a better training starting point. Further enhancing efficiency, Pangu Light incorporates specialized optimizations such as absorbing Post-RMSNorm computations and tailors its strategies to Ascend NPU characteristics. The Pangu Light models consistently exhibit a superior accuracy-efficiency trade-off, outperforming prominent baseline pruning methods like Nemotron and established LLMs like Qwen3 series. For instance, on Ascend NPUs, Pangu Light-32B's 81.6 average score and 2585 tokens/s throughput exceed Qwen3-32B's 80.9 average score and 2225 tokens/s.

Paper Structure

This paper contains 32 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Conceptual overview of the Pangu Light methodology, illustrating its integrated approach that combines importance-based structural pruning with novel weight re-initialization strategies (a)(b) and specialized normalization layer optimization (c).
  • Figure 2: Performance ratio with respect to pruning ratio and acceleration ratio, illustrating the accuracy-efficiency trade-off. The Pangu Light series exhibits a more favorable curve than those of both the Qwen3 qwen3 series and the PUZZLE bercovich2024puzzle framework.
  • Figure 3: Distribution of the Sandwich-Norm's affine scale parameters $\boldsymbol{\gamma}$ before and after pruning. Statistics, specifically mean and standard deviation, are shown for retained components within each relevant layer. The consistent distributions post-pruning, particularly after applying activation-based channel pruning with Stabilized LayerNorm Pruning (SLNP), indicate the stability of our method in preserving learned parameter characteristics.