Table of Contents
Fetching ...

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Mubashara Akhtar, Anka Reuel, Prajna Soni, Sanchit Ahuja, Pawan Sasanka Ammanamanchi, Ruchit Rawal, Vilém Zouhar, Srishti Yadav, Chenxi Whitehouse, Dayeon Ki, Jennifer Mickel, Leshem Choshen, Marek Šuppa, Jan Batzner, Jenny Chim, Jeba Sania, Yanan Long, Hossein A. Rahmani, Christina Knight, Yiyang Nan, Jyoutir Raj, Yu Fan, Shubham Singh, Subramanyam Sahoo, Eliya Habba, Usman Gohar, Siddhesh Pawar, Robert Scholz, Arjun Subramonian, Jingwei Ni, Mykel Kochenderfer, Sanmi Koyejo, Mrinmaya Sachan, Stella Biderman, Zeerak Talat, Avijit Ghosh, Irene Solaiman

TL;DR

This study analyzes benchmark saturation across 60 Large Language Model benchmarks selected from technical reports by major model developers to identify factors driving saturation, and characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format.

Abstract

Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

TL;DR

This study analyzes benchmark saturation across 60 Large Language Model benchmarks selected from technical reports by major model developers to identify factors driving saturation, and characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format.

Abstract

Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.
Paper Structure (9 sections, 2 figures, 4 tables)

This paper contains 9 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 4: Forest plot of posterior regression coefficients from the joint interaction model predicting benchmark saturation. Points denote posterior means, inner line segments show 50% highest posterior density intervals, and outer segments indicate 95% credible intervals. Benchmark age and test set size exhibit the most consistent effects on saturation, while task format, literal diversity (templating), and their interactions show no strong effects after controlling for confounders.
  • Figure 5: Posterior distribution of the AUROC for the interaction model predicting benchmark saturation. The distribution is tightly concentrated near high values (median approx. 0.98), indicating that the model distinguishes saturated from non-saturated benchmarks across posterior samples.