Table of Contents
Fetching ...

Test-Time Scaling Makes Overtraining Compute-Optimal

Nicholas Roberts, Sungjun Cho, Zhiqi Gao, Tzu-Heng Huang, Albert Wu, Gabriel Orlanski, Avi Trost, Kelly Buchanan, Aws Albarghouthi, Frederic Sala

Abstract

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test ($T^2$) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. $T^2$ modernizes pretraining scaling laws with pass@$k$ modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from $T^2$ are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that $T^2$ scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making $T^2$ scaling meaningful in modern deployments.

Test-Time Scaling Makes Overtraining Compute-Optimal

Abstract

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test () scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. modernizes pretraining scaling laws with pass@ modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making scaling meaningful in modern deployments.

Paper Structure

This paper contains 24 sections, 14 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Our $T^2$ scaling laws combine Chinchilla scaling for pretraining with pass@$k$ modeling for test-time scaling via repeated sampling to obtain optimal pretraining allocations subject to a test-time scaling budget. $T^2$ recommends overtraining compared to Chinchilla.
  • Figure 2: Optimal pretraining forecasts predicted by both $T^2$ approaches, compared to hoffmann2022training. (Left) Optimal tokens per parameter (including the 20 tokens per parameter rule of thumb used by practitioners), (Middle) Optimal model sizes. (Right) Optimal training set sizes. Both $T^2$ approaches forecast extreme overtraining.
  • Figure 3: $T^2$ scaling across all of our evaluation tasks. Both approaches improve monotonically over Chinchilla scaling, while Chinchilla exhibits non-monotonic scaling in $C_\text{train}$.
  • Figure 4: Extrapolating porian2024resolving checkpoints to the overtraining regime.
  • Figure 5: $T^2$ overtraining findings survive post-training. The optimal frontier is slightly subdued compared to base models, which is consistent with springer2025overtrained.
  • ...and 3 more figures