Table of Contents
Fetching ...

Fantastic Bugs and Where to Find Them in AI Benchmarks

Sang Truong, Yuheng Tu, Michael Hardy, Anka Reuel, Zeyu Tang, Jirayu Burapacheep, Jonathan Perera, Chibuike Uwakwe, Ben Domingue, Nick Haber, Sanmi Koyejo

TL;DR

This work introduces a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review and introduces an LLM-judge first pass to review questions, further reducing human effort.

Abstract

Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation. In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be problematic. Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84\% precision. In addition, we introduce an LLM-judge first pass to review questions, further reducing human effort. Together, these components provide an efficient and scalable framework for systematic benchmark revision.

Fantastic Bugs and Where to Find Them in AI Benchmarks

TL;DR

This work introduces a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review and introduces an LLM-judge first pass to review questions, further reducing human effort.

Abstract

Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation. In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be problematic. Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84\% precision. In addition, we introduce an LLM-judge first pass to review questions, further reducing human effort. Together, these components provide an efficient and scalable framework for systematic benchmark revision.

Paper Structure

This paper contains 20 sections, 4 theorems, 5 equations, 5 figures, 2 tables.

Key Result

Lemma 1

If the family $\{p(X\mid\theta_i): \theta_i\in\Theta\}$ admits the sum score as a sufficient statistic for $\theta_i$, then the latent structure is unidimensional.

Figures (5)

  • Figure 1: Left: Sensitivity curves on GSM8K for our three measurement-theoretic methods, two baselines, and four ensemble methods: Gaussian Rank Mean, OR Vote, AND Vote, and Majority Vote. Our methods significantly outperform the baselines. No single method uncovers all invalid questions, and each method flags different sets of questions. Right: Precision@50 across the nine benchmarks reviewed by human experts, where questions are examined in the order of the anomaly scores produced by our method. The number of truly invalid questions among the 50 inspected is shown to the right of each bar (2% corresponds to one question). Expert review confirms that up to 84% of the flagged questions exhibit substantive flaws.
  • Figure 2: (a) Precision@50 as a function of the number of LLMs on GSM8K, repeated over 10 random seeds; error bars denote one standard deviation. (b) Precision@50 as a function of the number of organizations, repeated over 10 random seeds; error bars denote one standard deviation. (c) Precision@50 versus model size cutoff. (d) Precision@50 versus release data cutoff. The performance of our methods increases as the number and diversity of LLMs increase.
  • Figure 3: Procedure of the LLM-judge first pass.
  • Figure 4: Each row is a benchmark, each column is an LLM. The blue entry indicates that the LLM is evaluated in the benchmark.
  • Figure :

Theorems & Definitions (8)

  • Lemma 1: Unidimensionality
  • proof
  • Theorem 1: Rasch Model, Theorem 2.1 from fischer1995rasch
  • proof
  • Corollary 1: Positivity of Tetrachoric Correlation under Unidimensionality
  • proof
  • Corollary 2: Positivity of Item-total Correlation under Unidimensionality
  • proof