Performance Law of Large Language Models

Chuhan Wu; Ruiming Tang

Performance Law of Large Language Models

Chuhan Wu, Ruiming Tang

TL;DR

The paper introduces the Performance Law, an empirical, interpretable predictor of MMLU performance for dense and MoE LLMs, derived from a small set of hyperparameters ($N$, $h$, $d$, $T$, $S$) and an instability discount that accounts for training precision. It extends this framework to MoEs using activated-parameter count ($A$), an expansion factor ($g$), and a modified depth-term ($d'$), with saturation via $T'=\,\min(T, S)$ and post-hoc adjustments to bound predictions. The authors validate the approach on 2024-era models, achieving strong predictive accuracy (across 55 models) and extract actionable insights about depth, hidden vs FFN size, data quality, and the promise and challenges of MoE, including data contamination detection and closed-source inference. They propose diverse applications, from forecasting upscaling potential and guiding architecture search to health monitoring and dense-model expansion planning, all aimed at reducing compute costs and carbon footprint. Overall, the Performance Law offers a practical, quantitative tool for LLM developers to optimize configurations and resource allocation while highlighting data quality and precision as critical factors in real-world performance.

Abstract

Guided by the belief of the scaling law, large language models (LLMs) have achieved impressive performance in recent years. However, scaling law only gives a qualitative estimation of loss, which is influenced by various factors such as model architectures, data distributions, tokenizers, and computation precision. Thus, estimating the real performance of LLMs with different training settings rather than loss may be quite useful in practical development. In this article, we present an empirical equation named "Performance Law" to directly predict the MMLU score of an LLM, which is a widely used metric to indicate the general capability of LLMs in real-world conversations and applications. Based on only a few key hyperparameters of the LLM architecture and the size of training data, we obtain a quite accurate MMLU prediction of various LLMs with diverse sizes and architectures developed by different organizations in different years. Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.

Performance Law of Large Language Models

TL;DR

The paper introduces the Performance Law, an empirical, interpretable predictor of MMLU performance for dense and MoE LLMs, derived from a small set of hyperparameters (

) and an instability discount that accounts for training precision. It extends this framework to MoEs using activated-parameter count (

), an expansion factor (

), and a modified depth-term (

), with saturation via

and post-hoc adjustments to bound predictions. The authors validate the approach on 2024-era models, achieving strong predictive accuracy (across 55 models) and extract actionable insights about depth, hidden vs FFN size, data quality, and the promise and challenges of MoE, including data contamination detection and closed-source inference. They propose diverse applications, from forecasting upscaling potential and guiding architecture search to health monitoring and dense-model expansion planning, all aimed at reducing compute costs and carbon footprint. Overall, the Performance Law offers a practical, quantitative tool for LLM developers to optimize configurations and resource allocation while highlighting data quality and precision as critical factors in real-world performance.

Abstract

Paper Structure (17 sections, 7 equations, 4 figures, 1 table)

This paper contains 17 sections, 7 equations, 4 figures, 1 table.

Introduction
Performance Law
Formulation of Performance Law
Examples of Performance Prediction
Results and Insights
Applications and Implications
Predict the Upscaling Potential of LLMs
Design Proper Model Architectures
Tracking the Health Status of Models
Planning Dense Model Expansion
Check Implicit Data Contamination
Infer the Model Structure and Data Size of Closed-Source Models
Discussions
Reasons Behind Prediction Errors
Are We Really Making Much Progress in These Years?
...and 2 more sections

Figures (4)

Figure 1: The prediction and real MMLU scores of different models.
Figure 2: The predicted MMLU scores under different model depths and precision degrees.
Figure 3: Predictions under different data sizes and $\gamma$ values.
Figure 4: The prediction and real metrics of two models.

Performance Law of Large Language Models

TL;DR

Abstract

Performance Law of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)