Table of Contents
Fetching ...

Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences

Eddie Yang, Dashun Wang

TL;DR

This work reveals a benchmark illusion: even when multiple LLMs achieve similar accuracy on standard reasoning benchmarks, they disagree on many items in ways that affect scientific inference. Using MMLU-Pro and GPQA, the authors show substantial item-level divergence across high-performing models, which translates into biased or even reversed downstream conclusions in simulation and two empirical case studies in education and political science. Through a measurement-error framework, a controlled simulation, and reanalyses of real studies, they demonstrate that model identity and error structure are critical design choices for inference, not mere technical details. The paper argues for a science-oriented evaluation regime that prioritizes agreement, stability, and calibrated uncertainty, and it suggests practical workflows that combine multiple models and calibrations to safeguard reproducibility in LLM-assisted research.

Abstract

Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks - MMLU-Pro and GPQA - we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models. These discrepancies suggest distinct error profiles for different LLMs. When such models are used for scientific data annotation and inference, their hidden disagreements propagate into research results: in re-analyses of published studies in education and political science, switching the annotation model can change estimated treatment effects by more than 80%, and in some cases reverses their sign. Together, these findings illustrate a benchmark illusion, where equal accuracy may conceal disagreement, with model choice becoming a hidden yet consequential variable for scientific reproducibility.

Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences

TL;DR

This work reveals a benchmark illusion: even when multiple LLMs achieve similar accuracy on standard reasoning benchmarks, they disagree on many items in ways that affect scientific inference. Using MMLU-Pro and GPQA, the authors show substantial item-level divergence across high-performing models, which translates into biased or even reversed downstream conclusions in simulation and two empirical case studies in education and political science. Through a measurement-error framework, a controlled simulation, and reanalyses of real studies, they demonstrate that model identity and error structure are critical design choices for inference, not mere technical details. The paper argues for a science-oriented evaluation regime that prioritizes agreement, stability, and calibrated uncertainty, and it suggests practical workflows that combine multiple models and calibrations to safeguard reproducibility in LLM-assisted research.

Abstract

Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks - MMLU-Pro and GPQA - we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models. These discrepancies suggest distinct error profiles for different LLMs. When such models are used for scientific data annotation and inference, their hidden disagreements propagate into research results: in re-analyses of published studies in education and political science, switching the annotation model can change estimated treatment effects by more than 80%, and in some cases reverses their sign. Together, these findings illustrate a benchmark illusion, where equal accuracy may conceal disagreement, with model choice becoming a hidden yet consequential variable for scientific reproducibility.
Paper Structure (13 sections, 2 equations, 9 figures)

This paper contains 13 sections, 2 equations, 9 figures.

Figures (9)

  • Figure 1: MMLU-pro -- Pairwise Model Proportion of Disagreement
  • Figure 2: GPQA -- Pairwise Model Proportion of Disagreement
  • Figure 3: Simulation result
  • Figure 4: Reanalysis of kim2021improving
  • Figure 5: Reanalysis of rozenas2019autocrats
  • ...and 4 more figures