Table of Contents
Fetching ...

Is there "Secret Sauce'' in Large Language Model Development?

Matthias Mertens, Natalia Fischl-Lanzoni, Neil Thompson

TL;DR

The paper investigates whether frontier LLM progress is driven by scale or proprietary techniques, using a dataset of $MMLU-Pro$ scores and training compute for 809 models released between 2022 and 2025. It decomposes observed performance differences into four components—scaling effects from compute, shared algorithmic progress, developer-specific efficiency (the secret sauce), and model-specific factors—via a regression on $logit$-transformed scores. The key findings show frontier improvements are predominantly explained by scaling (roughly $80-90\%$ of performance), but algorithmic progress and developer-specific efficiency contribute meaningfully, with substantial dispersion in compute efficiency both across and within firms. Algorithmic progress yields large efficiency gains enabling much smaller models to reach fixed scores (up to about $8{,}000\times$ compute reductions when including smaller developers), suggesting that efficiency improvements can democratize capabilities while reducing costs, though diffusion may create rents for firms with proprietary techniques. Overall, the results imply sustained AI leadership depends on access to expanding compute, while efficiency gains diffuse differently across models and firms, influencing future frontier progress and price dynamics.

Abstract

Do leading LLM developers possess a proprietary ``secret sauce'', or is LLM performance driven by scaling up compute? Using training and benchmark data for 809 models released between 2022 and 2025, we estimate scaling-law regressions with release-date and developer fixed effects. We find clear evidence of developer-specific efficiency advantages, but their importance depends on where models lie in the performance distribution. At the frontier, 80-90% of performance differences are explained by higher training compute, implying that scale--not proprietary technology--drives frontier advances. Away from the frontier, however, proprietary techniques and shared algorithmic progress substantially reduce the compute required to reach fixed capability thresholds. Some companies can systematically produce smaller models more efficiently. Strikingly, we also find substantial variation of model efficiency within companies; a firm can train two models with more than 40x compute efficiency difference. We also discuss the implications for AI leadership and capability diffusion.

Is there "Secret Sauce'' in Large Language Model Development?

TL;DR

The paper investigates whether frontier LLM progress is driven by scale or proprietary techniques, using a dataset of scores and training compute for 809 models released between 2022 and 2025. It decomposes observed performance differences into four components—scaling effects from compute, shared algorithmic progress, developer-specific efficiency (the secret sauce), and model-specific factors—via a regression on -transformed scores. The key findings show frontier improvements are predominantly explained by scaling (roughly of performance), but algorithmic progress and developer-specific efficiency contribute meaningfully, with substantial dispersion in compute efficiency both across and within firms. Algorithmic progress yields large efficiency gains enabling much smaller models to reach fixed scores (up to about compute reductions when including smaller developers), suggesting that efficiency improvements can democratize capabilities while reducing costs, though diffusion may create rents for firms with proprietary techniques. Overall, the results imply sustained AI leadership depends on access to expanding compute, while efficiency gains diffuse differently across models and firms, influencing future frontier progress and price dynamics.

Abstract

Do leading LLM developers possess a proprietary ``secret sauce'', or is LLM performance driven by scaling up compute? Using training and benchmark data for 809 models released between 2022 and 2025, we estimate scaling-law regressions with release-date and developer fixed effects. We find clear evidence of developer-specific efficiency advantages, but their importance depends on where models lie in the performance distribution. At the frontier, 80-90% of performance differences are explained by higher training compute, implying that scale--not proprietary technology--drives frontier advances. Away from the frontier, however, proprietary techniques and shared algorithmic progress substantially reduce the compute required to reach fixed capability thresholds. Some companies can systematically produce smaller models more efficiently. Strikingly, we also find substantial variation of model efficiency within companies; a firm can train two models with more than 40x compute efficiency difference. We also discuss the implications for AI leadership and capability diffusion.
Paper Structure (32 sections, 4 equations, 20 figures, 3 tables)

This paper contains 32 sections, 4 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Shapley Variance Decomposition, Different Samples
  • Figure 2: Main Results
  • Figure 3: Contributions to Top Model Over Time
  • Figure 4: Sources of Performance Growth: Frontier Models and Smaller, Efficient Models
  • Figure C.1: MMLU-Pro Score vs Log(FLOPs) Data Visualization with a Logistic Curve Fit
  • ...and 15 more figures