Table of Contents
Fetching ...

Selecting Large Language Model to Fine-tune via Rectified Scaling Law

Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, Yitao Liang

TL;DR

The paper tackles selecting an appropriate pre-trained LLM for downstream fine-tuning under strict resource constraints. It reveals a phase transition in fine-tuning scaling, introducing the Rectified Scaling Law with a pre-learned data size $D_l$ and the Accept then Stop (AtS) algorithm to extrapolate full-data performance from limited subsets. Key contributions include a robust scaling law that fits both pre-power and power phases, an efficient data-budgeted LLM selection method, and extensive experiments across 30 models and three downstream tasks demonstrating substantial computational savings with near-optimal model choices. This work enables practical, scalable, and principled LLM selection for real-world applications, reducing energy use and cost while maintaining performance.

Abstract

The ever-growing ecosystem of LLMs has posed a challenge in selecting the most appropriate pre-trained model to fine-tune amidst a sea of options. Given constrained resources, fine-tuning all models and making selections afterward is unrealistic. In this work, we formulate this resource-constrained selection task into predicting fine-tuning performance and illustrate its natural connection with Scaling Law. Unlike pre-training, we find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase". We also explain why existing Scaling Law fails to capture this phase transition phenomenon both theoretically and empirically. To address this, we introduce the concept of "pre-learned data size" into our Rectified Scaling Law, which overcomes theoretical limitations and fits experimental results much better. By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption, while other methods may provide negatively correlated selection. The project page is available at rectified-scaling-law.github.io.

Selecting Large Language Model to Fine-tune via Rectified Scaling Law

TL;DR

The paper tackles selecting an appropriate pre-trained LLM for downstream fine-tuning under strict resource constraints. It reveals a phase transition in fine-tuning scaling, introducing the Rectified Scaling Law with a pre-learned data size and the Accept then Stop (AtS) algorithm to extrapolate full-data performance from limited subsets. Key contributions include a robust scaling law that fits both pre-power and power phases, an efficient data-budgeted LLM selection method, and extensive experiments across 30 models and three downstream tasks demonstrating substantial computational savings with near-optimal model choices. This work enables practical, scalable, and principled LLM selection for real-world applications, reducing energy use and cost while maintaining performance.

Abstract

The ever-growing ecosystem of LLMs has posed a challenge in selecting the most appropriate pre-trained model to fine-tune amidst a sea of options. Given constrained resources, fine-tuning all models and making selections afterward is unrealistic. In this work, we formulate this resource-constrained selection task into predicting fine-tuning performance and illustrate its natural connection with Scaling Law. Unlike pre-training, we find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase". We also explain why existing Scaling Law fails to capture this phase transition phenomenon both theoretically and empirically. To address this, we introduce the concept of "pre-learned data size" into our Rectified Scaling Law, which overcomes theoretical limitations and fits experimental results much better. By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption, while other methods may provide negatively correlated selection. The project page is available at rectified-scaling-law.github.io.
Paper Structure (52 sections, 2 theorems, 14 equations, 12 figures, 12 tables, 1 algorithm)

This paper contains 52 sections, 2 theorems, 14 equations, 12 figures, 12 tables, 1 algorithm.

Key Result

Theorem 3.1

For any positive parameters $B, E, \alpha, \beta$, consider the log-log form of function $\hat{{\mathcal{L}}}(\cdot)$ in eq:prev_law_fix: then we have that the derivative $f'$ is negative and non-decreasing.

Figures (12)

  • Figure 1: (a) The Pearson correlation between the true full-fine-tuning performance and the predicted performance of three intuitive methods, given different resource constraints denoted by $\gamma$. These baseline methods cannot predict performance well especially under demanding constraints (small $\gamma$), and could even provide negatively correlated predictions. (b) The phase transition phenomenon observed in the scaling of fine-tuning loss $L$ with training sample size $D$. In addition to the widely studied power phase where $(L,D)$ are linearly correlated under the log-log scale, we discover the pre-power phase when $D$ is small. Previous laws fail to fit both phases, while our proposed law fits quite well. (c) Our LLM selection algorithm that extrapolates full-fine-tuning performance based on the new law.
  • Figure 2: The difference of scaling behavior in pre-training and fine-tuning. While in pre-training the performance scales with model sizes independent from model shapes, in fine-tuning the performance does not. The figure is drawn based on Figure 1 in tay2021scale.
  • Figure 3: The phase transition from pre-power phase to power phase, and the fitness of different Scaling Laws. The x and y axes are fine-tuning dataset size $D$ and test loss $L$ in log scale. Each subfigure corresponds to a dataset. Solid lines are the fitting results of our law (Eq. \ref{['eq.law']}), and dash lines are the fitting results of vanilla law (Eq. \ref{['eq:prev_law_fix']}). The full model results are in \ref{['app:fine-tune-results']}.
  • Figure 4: Root-mean-square deviation (RMSD) of our law (\ref{['eq.law']}) and vanilla law (\ref{['eq:prev_law_fix']}) when fitting fine-tuning test loss versus dataset size in log scale. Under same setting, our law achieves much lower RMSD error.
  • Figure 5: Failure cases for the three baseline methods. The horizontal dashlines denote the zero-shot performance, and each point denotes the test loss when fine-tuning the corresponding model on ${\mathcal{S}}_{sub}$ with size $D$. LaMini-GPT-124M has the best full-fine-tuning performance, but its performance on small $D$ is bad.
  • ...and 7 more figures

Theorems & Definitions (5)

  • Definition 2.1: LLM Selection for Fine-tuning
  • Definition 2.2: Power-law in kaplan2020scaling
  • Theorem 3.1
  • Definition 3.2: Rectified Scaling Law
  • Theorem 3.3