Table of Contents
Fetching ...

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, Jason Ramapuram

TL;DR

This work reframes downstream benchmark performance as a direct, scale-aware quantity by modeling log-accuracy as a power-law function of training FLOPs under a fixed token-to-parameter ratio. It demonstrates that simple direct fits, including a Power Law and a Broken Neural Scaling Law variant, can robustly predict downstream task performance and surpass traditional two-stage approaches in extrapolation. The authors extend the framework across token-to-parameter ratios and repeated sampling (pass@k), and validate on models up to 17B parameters with 350B tokens, across 12 benchmarks. They also show that the data mixture crucially conditions scaling behavior and provide a reproducible, data-sharing path for future research. Overall, the paper offers a practical, end-to-end methodology for forecasting downstream capabilities from scale, aiding planning and efficiency in large-language-model training.

Abstract

While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-to-parameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for inference compute under repeated sampling. We validate our findings on models with up to 17B parameters trained on up to 350B tokens across two dataset mixtures. To support reproducibility and encourage future research, we release the complete set of pretraining losses and downstream evaluation results.

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

TL;DR

This work reframes downstream benchmark performance as a direct, scale-aware quantity by modeling log-accuracy as a power-law function of training FLOPs under a fixed token-to-parameter ratio. It demonstrates that simple direct fits, including a Power Law and a Broken Neural Scaling Law variant, can robustly predict downstream task performance and surpass traditional two-stage approaches in extrapolation. The authors extend the framework across token-to-parameter ratios and repeated sampling (pass@k), and validate on models up to 17B parameters with 350B tokens, across 12 benchmarks. They also show that the data mixture crucially conditions scaling behavior and provide a reproducible, data-sharing path for future research. Overall, the paper offers a practical, end-to-end methodology for forecasting downstream capabilities from scale, aiding planning and efficiency in large-language-model training.

Abstract

While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-to-parameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for inference compute under repeated sampling. We validate our findings on models with up to 17B parameters trained on up to 350B tokens across two dataset mixtures. To support reproducibility and encourage future research, we release the complete set of pretraining losses and downstream evaluation results.

Paper Structure

This paper contains 42 sections, 15 equations, 17 figures, 16 tables.

Figures (17)

  • Figure 1: Benchmark accuracy can be described using a direct scaling law based on training FLOPs. The solid line represents the scaling law fit using \ref{['eq:scaling_law_basic']}, and each point corresponds to accuracy measured for the final checkpoint at a given training budget.
  • Figure 2: Comparison of scaling the downstream accuracy on different token-to-parameter ratios. Fit quality for all benchmarks can be found in Appendix \ref{['app_multi_tpr']}.
  • Figure 3: Comparison of pass@k behaviour across tasks. (a) Intuition for the functional form. (b) Predicted pass rate curves for HumanEval. (c) Predicted pass rate curves for LBPP.
  • Figure 4: Dependency between downstream task and proxy metric candidates. All metrics demonstrate strong prediction power, i.e. high $R^2$ and low RMSE.
  • Figure 5: Scaling law fits. Comparing the direct approaches: Power Law (Equation equation \ref{['eq:scaling_law_basic']}), BNSL (Equation equation \ref{['bnsl']}) from \ref{['sub:scaling_full']} with two-stage approaches (Linear and Logistic) for ARC Challenge. Plots for all benchmarks are shown in Appendix \ref{['app:all_laws_benchmarks']}.
  • ...and 12 more figures