Table of Contents
Fetching ...

Scaling Laws for Predicting Downstream Performance in LLMs

Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, Heng Ji

TL;DR

This work addresses predicting downstream performance of LLMs before full pre-training by introducing FLP, a two-stage scaling framework that first links compute to pre-training loss via a power-law, then maps loss to downstream performance with a linear relation. To improve sample efficiency and handle data heterogeneity, the authors extend FLP into FLP-M, which incorporates domain-specific losses from multiple data sources using a two-layer neural network to predict downstream outcomes. Empirical results show FLP can predict 7B and 13B target models with 5% and 10% relative error, respectively, using sampling LMs up to 3B, outperforming direct FLOPs-to-Performance baselines; FLP-M further improves predictions across mixed data regimes and identifies data-mixing ratios that optimize task performance. The work also includes ablations and analyses demonstrating the benefits and limits of domain-specific loss modeling, offering a practical framework for planning data mixtures and compute in large-scale pre-training. Overall, FLP and FLP-M provide principled, data-efficient tools for forecasting downstream capabilities and guiding data composition during LLM pre-training.

Abstract

Precise estimation of downstream performance in large language models (LLMs) prior to training is essential for guiding their development process. Scaling laws analysis utilizes the statistics of a series of significantly smaller sampling language models (LMs) to predict the performance of the target LLM. For downstream performance prediction, the critical challenge lies in the emergent abilities in LLMs that occur beyond task-specific computational thresholds. In this work, we focus on the pre-training loss as a more computation-efficient metric for performance estimation. Our two-stage approach FLP consists of first estimating a function that maps computational resources (e.g., FLOPs) to the pre-training Loss using a series of fully-converged sampling models, followed by mapping the pre-training loss to downstream task Performance using the intermediate models with emerged performance. In our experiments, this FLP solution accurately predicts the performance of LLMs with 7B and 13B parameters using a series of sampling LMs up to 3B, achieving error margins of 5% and 10%, respectively, and significantly outperforming the FLOPs-to-Performance approach. Further, we present FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training. FLP-M extends the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources, and employs a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance. By utilizing a 3B LLM trained on a specific ratio and a series of smaller sampling LMs, FLP-M can effectively forecast the performance of 3B and 7B LLMs across various data mixtures for most benchmarks within 10% error margins.

Scaling Laws for Predicting Downstream Performance in LLMs

TL;DR

This work addresses predicting downstream performance of LLMs before full pre-training by introducing FLP, a two-stage scaling framework that first links compute to pre-training loss via a power-law, then maps loss to downstream performance with a linear relation. To improve sample efficiency and handle data heterogeneity, the authors extend FLP into FLP-M, which incorporates domain-specific losses from multiple data sources using a two-layer neural network to predict downstream outcomes. Empirical results show FLP can predict 7B and 13B target models with 5% and 10% relative error, respectively, using sampling LMs up to 3B, outperforming direct FLOPs-to-Performance baselines; FLP-M further improves predictions across mixed data regimes and identifies data-mixing ratios that optimize task performance. The work also includes ablations and analyses demonstrating the benefits and limits of domain-specific loss modeling, offering a practical framework for planning data mixtures and compute in large-scale pre-training. Overall, FLP and FLP-M provide principled, data-efficient tools for forecasting downstream capabilities and guiding data composition during LLM pre-training.

Abstract

Precise estimation of downstream performance in large language models (LLMs) prior to training is essential for guiding their development process. Scaling laws analysis utilizes the statistics of a series of significantly smaller sampling language models (LMs) to predict the performance of the target LLM. For downstream performance prediction, the critical challenge lies in the emergent abilities in LLMs that occur beyond task-specific computational thresholds. In this work, we focus on the pre-training loss as a more computation-efficient metric for performance estimation. Our two-stage approach FLP consists of first estimating a function that maps computational resources (e.g., FLOPs) to the pre-training Loss using a series of fully-converged sampling models, followed by mapping the pre-training loss to downstream task Performance using the intermediate models with emerged performance. In our experiments, this FLP solution accurately predicts the performance of LLMs with 7B and 13B parameters using a series of sampling LMs up to 3B, achieving error margins of 5% and 10%, respectively, and significantly outperforming the FLOPs-to-Performance approach. Further, we present FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training. FLP-M extends the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources, and employs a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance. By utilizing a 3B LLM trained on a specific ratio and a series of smaller sampling LMs, FLP-M can effectively forecast the performance of 3B and 7B LLMs across various data mixtures for most benchmarks within 10% error margins.

Paper Structure

This paper contains 32 sections, 6 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: The performance of sampling LMs with increasing compute. x represents non-emerged data points, and $\bullet$ indicates emerged data points that surpass a randomness threshold of 5.
  • Figure 2: The downstream performance prediction using FP and FLP fit curves. FLP can better predict the downstream performance of target 7B and 13B LLMs across all evaluation benchmarks.
  • Figure 3: The relative prediction error of 7B and 13B LLMs. FLP achieves a more accurate prediction with error margins of 5% and 10% across all benchmarks for two LLMs respectively.
  • Figure 4: The downstream performance prediction using FLP and FLP-M fit curves. FLP-M can better predict the downstream performance of target LLMs across various data mixing ratios.
  • Figure 5: The relative prediction error of downstream performance prediction using FLP and FLP-M. FLP-M can better predict the performance of target LLMs across various data mixing ratios.
  • ...and 11 more figures