Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, Yitao Liang
TL;DR
The paper tackles selecting an appropriate pre-trained LLM for downstream fine-tuning under strict resource constraints. It reveals a phase transition in fine-tuning scaling, introducing the Rectified Scaling Law with a pre-learned data size $D_l$ and the Accept then Stop (AtS) algorithm to extrapolate full-data performance from limited subsets. Key contributions include a robust scaling law that fits both pre-power and power phases, an efficient data-budgeted LLM selection method, and extensive experiments across 30 models and three downstream tasks demonstrating substantial computational savings with near-optimal model choices. This work enables practical, scalable, and principled LLM selection for real-world applications, reducing energy use and cost while maintaining performance.
Abstract
The ever-growing ecosystem of LLMs has posed a challenge in selecting the most appropriate pre-trained model to fine-tune amidst a sea of options. Given constrained resources, fine-tuning all models and making selections afterward is unrealistic. In this work, we formulate this resource-constrained selection task into predicting fine-tuning performance and illustrate its natural connection with Scaling Law. Unlike pre-training, we find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase". We also explain why existing Scaling Law fails to capture this phase transition phenomenon both theoretically and empirically. To address this, we introduce the concept of "pre-learned data size" into our Rectified Scaling Law, which overcomes theoretical limitations and fits experimental results much better. By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption, while other methods may provide negatively correlated selection. The project page is available at rectified-scaling-law.github.io.
