Scaling Laws for Predicting Downstream Performance in LLMs
Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, Heng Ji
TL;DR
This work addresses predicting downstream performance of LLMs before full pre-training by introducing FLP, a two-stage scaling framework that first links compute to pre-training loss via a power-law, then maps loss to downstream performance with a linear relation. To improve sample efficiency and handle data heterogeneity, the authors extend FLP into FLP-M, which incorporates domain-specific losses from multiple data sources using a two-layer neural network to predict downstream outcomes. Empirical results show FLP can predict 7B and 13B target models with 5% and 10% relative error, respectively, using sampling LMs up to 3B, outperforming direct FLOPs-to-Performance baselines; FLP-M further improves predictions across mixed data regimes and identifies data-mixing ratios that optimize task performance. The work also includes ablations and analyses demonstrating the benefits and limits of domain-specific loss modeling, offering a practical framework for planning data mixtures and compute in large-scale pre-training. Overall, FLP and FLP-M provide principled, data-efficient tools for forecasting downstream capabilities and guiding data composition during LLM pre-training.
Abstract
Precise estimation of downstream performance in large language models (LLMs) prior to training is essential for guiding their development process. Scaling laws analysis utilizes the statistics of a series of significantly smaller sampling language models (LMs) to predict the performance of the target LLM. For downstream performance prediction, the critical challenge lies in the emergent abilities in LLMs that occur beyond task-specific computational thresholds. In this work, we focus on the pre-training loss as a more computation-efficient metric for performance estimation. Our two-stage approach FLP consists of first estimating a function that maps computational resources (e.g., FLOPs) to the pre-training Loss using a series of fully-converged sampling models, followed by mapping the pre-training loss to downstream task Performance using the intermediate models with emerged performance. In our experiments, this FLP solution accurately predicts the performance of LLMs with 7B and 13B parameters using a series of sampling LMs up to 3B, achieving error margins of 5% and 10%, respectively, and significantly outperforming the FLOPs-to-Performance approach. Further, we present FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training. FLP-M extends the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources, and employs a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance. By utilizing a 3B LLM trained on a specific ratio and a series of smaller sampling LMs, FLP-M can effectively forecast the performance of 3B and 7B LLMs across various data mixtures for most benchmarks within 10% error margins.
