P$^2$ Law: Scaling Law for Post-Training After Model Pruning
Xiaodong Chen, Yuxuan Hu, Xiaokang Zhang, Yanling Wang, Cuiping Li, Hong Chen, Jing Zhang
TL;DR
The paper tackles the data-efficient budgeting problem for post-training after model pruning by introducing the P^2 Law, a scaling framework that predicts post-training loss based on four factors: pre-pruning size $N_0$, post-training data $D$, pruning rate $\rho$, and pre-pruning loss $\mathcal{L}_0$. Grounded in the Chinchilla scaling law, the authors derive candidate parameterizations and introduce Average Slope Difference (ASD) to select the best fit, ultimately selecting $\mathcal{L}_1(N_0,D,\rho,\mathcal{L}_0) = \mathcal{L}_0 + (\frac{1}{\rho})^{\gamma}(\frac{1}{N_0})^{\delta}(\frac{N_C}{N_0^{\alpha}} + \frac{D_C}{D^{\beta}} + E)$. Through extensive experiments on Llama-3 and Qwen-2.5 with depth, width, and 2:4 semi-structured pruning, the study shows that P^2 Law accurately fits post-training losses and generalizes to larger datasets, larger models, and higher pruning rates, offering a principled data-budget tool for post-training. The results also demonstrate the superiority of the L1 parameterization over alternatives in fitting quality and condition satisfaction, providing actionable guidance for resource allocation in pruned LLM post-training. Finally, the work outlines limitations due to computational constraints and points to future work extending the law to even larger models and broader pruning scenarios.
Abstract
Pruning has become a widely adopted technique for reducing the hardware requirements of large language models (LLMs). To recover model performance after pruning, post-training is commonly employed to mitigate the resulting performance degradation. While post-training benefits from larger datasets, once the dataset size is already substantial, increasing the training data provides only limited performance gains. To balance post-training cost and model performance, it is necessary to explore the optimal amount of post-training data.Through extensive experiments on the Llama-3 and Qwen-2.5 series models, pruned using various common pruning methods, we uncover the scaling \textbf{Law} for \textbf{P}ost-training after model \textbf{P}runing, referred to as the P$^2$ Law.This law identifies four key factors for predicting the pruned model's post-training loss: the model size before pruning, the number of post-training tokens, the pruning rate, and the model's loss before pruning. Moreover, P$^2$ Law can generalize to larger dataset sizes, larger model sizes, and higher pruning rates, offering valuable insights for the post-training of pruned LLMs.
