Table of Contents
Fetching ...

Performance Law of Large Language Models

Chuhan Wu, Ruiming Tang

TL;DR

The paper introduces the Performance Law, an empirical, interpretable predictor of MMLU performance for dense and MoE LLMs, derived from a small set of hyperparameters ($N$, $h$, $d$, $T$, $S$) and an instability discount that accounts for training precision. It extends this framework to MoEs using activated-parameter count ($A$), an expansion factor ($g$), and a modified depth-term ($d'$), with saturation via $T'=\,\min(T, S)$ and post-hoc adjustments to bound predictions. The authors validate the approach on 2024-era models, achieving strong predictive accuracy (across 55 models) and extract actionable insights about depth, hidden vs FFN size, data quality, and the promise and challenges of MoE, including data contamination detection and closed-source inference. They propose diverse applications, from forecasting upscaling potential and guiding architecture search to health monitoring and dense-model expansion planning, all aimed at reducing compute costs and carbon footprint. Overall, the Performance Law offers a practical, quantitative tool for LLM developers to optimize configurations and resource allocation while highlighting data quality and precision as critical factors in real-world performance.

Abstract

Guided by the belief of the scaling law, large language models (LLMs) have achieved impressive performance in recent years. However, scaling law only gives a qualitative estimation of loss, which is influenced by various factors such as model architectures, data distributions, tokenizers, and computation precision. Thus, estimating the real performance of LLMs with different training settings rather than loss may be quite useful in practical development. In this article, we present an empirical equation named "Performance Law" to directly predict the MMLU score of an LLM, which is a widely used metric to indicate the general capability of LLMs in real-world conversations and applications. Based on only a few key hyperparameters of the LLM architecture and the size of training data, we obtain a quite accurate MMLU prediction of various LLMs with diverse sizes and architectures developed by different organizations in different years. Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.

Performance Law of Large Language Models

TL;DR

The paper introduces the Performance Law, an empirical, interpretable predictor of MMLU performance for dense and MoE LLMs, derived from a small set of hyperparameters (, , , , ) and an instability discount that accounts for training precision. It extends this framework to MoEs using activated-parameter count (), an expansion factor (), and a modified depth-term (), with saturation via and post-hoc adjustments to bound predictions. The authors validate the approach on 2024-era models, achieving strong predictive accuracy (across 55 models) and extract actionable insights about depth, hidden vs FFN size, data quality, and the promise and challenges of MoE, including data contamination detection and closed-source inference. They propose diverse applications, from forecasting upscaling potential and guiding architecture search to health monitoring and dense-model expansion planning, all aimed at reducing compute costs and carbon footprint. Overall, the Performance Law offers a practical, quantitative tool for LLM developers to optimize configurations and resource allocation while highlighting data quality and precision as critical factors in real-world performance.

Abstract

Guided by the belief of the scaling law, large language models (LLMs) have achieved impressive performance in recent years. However, scaling law only gives a qualitative estimation of loss, which is influenced by various factors such as model architectures, data distributions, tokenizers, and computation precision. Thus, estimating the real performance of LLMs with different training settings rather than loss may be quite useful in practical development. In this article, we present an empirical equation named "Performance Law" to directly predict the MMLU score of an LLM, which is a widely used metric to indicate the general capability of LLMs in real-world conversations and applications. Based on only a few key hyperparameters of the LLM architecture and the size of training data, we obtain a quite accurate MMLU prediction of various LLMs with diverse sizes and architectures developed by different organizations in different years. Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.
Paper Structure (17 sections, 7 equations, 4 figures, 1 table)

This paper contains 17 sections, 7 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The prediction and real MMLU scores of different models.
  • Figure 2: The predicted MMLU scores under different model depths and precision degrees.
  • Figure 3: Predictions under different data sizes and $\gamma$ values.
  • Figure 4: The prediction and real metrics of two models.