Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores
Robert E. Blackwell, Jon Barry, Anthony G. Cohn
TL;DR
This paper tackles the reproducibility crisis in large language model (LLM) evaluation by quantifying uncertainty in benchmark scores. It introduces a cost-efficient approach using prediction intervals to measure how many experimental repeats are needed for stable results, demonstrated on two cardinal-direction reasoning benchmarks across six LLMs. Key findings show that forcing temperature to 0 and fixing the seed significantly reduces score variability and that, in many cases, only a few repeats are required to achieve tight prediction intervals, though API hosting can still introduce mean differences. The work provides practical guidelines for reporting benchmark results (as $ar{x} \pm$ PI) and stresses thorough documentation of experimental conditions to enable reproducible cross-model comparisons.
Abstract
Large language models (LLMs) are stochastic, and not all models give deterministic answers, even when setting temperature to zero with a fixed random seed. However, few benchmark studies attempt to quantify uncertainty, partly due to the time and cost of repeated experiments. We use benchmarks designed for testing LLMs' capacity to reason about cardinal directions to explore the impact of experimental repeats on mean score and prediction interval. We suggest a simple method for cost-effectively quantifying the uncertainty of a benchmark score and make recommendations concerning reproducible LLM evaluation.
