P$^2$ Law: Scaling Law for Post-Training After Model Pruning

Xiaodong Chen; Yuxuan Hu; Xiaokang Zhang; Yanling Wang; Cuiping Li; Hong Chen; Jing Zhang

P$^2$ Law: Scaling Law for Post-Training After Model Pruning

Xiaodong Chen, Yuxuan Hu, Xiaokang Zhang, Yanling Wang, Cuiping Li, Hong Chen, Jing Zhang

TL;DR

The paper tackles the data-efficient budgeting problem for post-training after model pruning by introducing the P^2 Law, a scaling framework that predicts post-training loss based on four factors: pre-pruning size $N_0$, post-training data $D$, pruning rate $\rho$, and pre-pruning loss $\mathcal{L}_0$. Grounded in the Chinchilla scaling law, the authors derive candidate parameterizations and introduce Average Slope Difference (ASD) to select the best fit, ultimately selecting $\mathcal{L}_1(N_0,D,\rho,\mathcal{L}_0) = \mathcal{L}_0 + (\frac{1}{\rho})^{\gamma}(\frac{1}{N_0})^{\delta}(\frac{N_C}{N_0^{\alpha}} + \frac{D_C}{D^{\beta}} + E)$. Through extensive experiments on Llama-3 and Qwen-2.5 with depth, width, and 2:4 semi-structured pruning, the study shows that P^2 Law accurately fits post-training losses and generalizes to larger datasets, larger models, and higher pruning rates, offering a principled data-budget tool for post-training. The results also demonstrate the superiority of the L1 parameterization over alternatives in fitting quality and condition satisfaction, providing actionable guidance for resource allocation in pruned LLM post-training. Finally, the work outlines limitations due to computational constraints and points to future work extending the law to even larger models and broader pruning scenarios.

Abstract

Pruning has become a widely adopted technique for reducing the hardware requirements of large language models (LLMs). To recover model performance after pruning, post-training is commonly employed to mitigate the resulting performance degradation. While post-training benefits from larger datasets, once the dataset size is already substantial, increasing the training data provides only limited performance gains. To balance post-training cost and model performance, it is necessary to explore the optimal amount of post-training data.Through extensive experiments on the Llama-3 and Qwen-2.5 series models, pruned using various common pruning methods, we uncover the scaling \textbf{Law} for \textbf{P}ost-training after model \textbf{P}runing, referred to as the P$^2$ Law.This law identifies four key factors for predicting the pruned model's post-training loss: the model size before pruning, the number of post-training tokens, the pruning rate, and the model's loss before pruning. Moreover, P$^2$ Law can generalize to larger dataset sizes, larger model sizes, and higher pruning rates, offering valuable insights for the post-training of pruned LLMs.

P$^2$ Law: Scaling Law for Post-Training After Model Pruning

TL;DR

, post-training data

, pruning rate

, and pre-pruning loss

. Grounded in the Chinchilla scaling law, the authors derive candidate parameterizations and introduce Average Slope Difference (ASD) to select the best fit, ultimately selecting

. Through extensive experiments on Llama-3 and Qwen-2.5 with depth, width, and 2:4 semi-structured pruning, the study shows that P^2 Law accurately fits post-training losses and generalizes to larger datasets, larger models, and higher pruning rates, offering a principled data-budget tool for post-training. The results also demonstrate the superiority of the L1 parameterization over alternatives in fitting quality and condition satisfaction, providing actionable guidance for resource allocation in pruned LLM post-training. Finally, the work outlines limitations due to computational constraints and points to future work extending the law to even larger models and broader pruning scenarios.

Abstract

Law.This law identifies four key factors for predicting the pruned model's post-training loss: the model size before pruning, the number of post-training tokens, the pruning rate, and the model's loss before pruning. Moreover, P

Law can generalize to larger dataset sizes, larger model sizes, and higher pruning rates, offering valuable insights for the post-training of pruned LLMs.

P$^2$ Law: Scaling Law for Post-Training After Model Pruning

TL;DR

Abstract

P$^2$ Law: Scaling Law for Post-Training After Model Pruning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (25)

Theorems & Definitions (2)