Table of Contents
Fetching ...

Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

Robert E. Blackwell, Jon Barry, Anthony G. Cohn

TL;DR

This paper tackles the reproducibility crisis in large language model (LLM) evaluation by quantifying uncertainty in benchmark scores. It introduces a cost-efficient approach using prediction intervals to measure how many experimental repeats are needed for stable results, demonstrated on two cardinal-direction reasoning benchmarks across six LLMs. Key findings show that forcing temperature to 0 and fixing the seed significantly reduces score variability and that, in many cases, only a few repeats are required to achieve tight prediction intervals, though API hosting can still introduce mean differences. The work provides practical guidelines for reporting benchmark results (as $ar{x} \pm$ PI) and stresses thorough documentation of experimental conditions to enable reproducible cross-model comparisons.

Abstract

Large language models (LLMs) are stochastic, and not all models give deterministic answers, even when setting temperature to zero with a fixed random seed. However, few benchmark studies attempt to quantify uncertainty, partly due to the time and cost of repeated experiments. We use benchmarks designed for testing LLMs' capacity to reason about cardinal directions to explore the impact of experimental repeats on mean score and prediction interval. We suggest a simple method for cost-effectively quantifying the uncertainty of a benchmark score and make recommendations concerning reproducible LLM evaluation.

Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

TL;DR

This paper tackles the reproducibility crisis in large language model (LLM) evaluation by quantifying uncertainty in benchmark scores. It introduces a cost-efficient approach using prediction intervals to measure how many experimental repeats are needed for stable results, demonstrated on two cardinal-direction reasoning benchmarks across six LLMs. Key findings show that forcing temperature to 0 and fixing the seed significantly reduces score variability and that, in many cases, only a few repeats are required to achieve tight prediction intervals, though API hosting can still introduce mean differences. The work provides practical guidelines for reporting benchmark results (as PI) and stresses thorough documentation of experimental conditions to enable reproducible cross-model comparisons.

Abstract

Large language models (LLMs) are stochastic, and not all models give deterministic answers, even when setting temperature to zero with a fixed random seed. However, few benchmark studies attempt to quantify uncertainty, partly due to the time and cost of repeated experiments. We use benchmarks designed for testing LLMs' capacity to reason about cardinal directions to explore the impact of experimental repeats on mean score and prediction interval. We suggest a simple method for cost-effectively quantifying the uncertainty of a benchmark score and make recommendations concerning reproducible LLM evaluation.
Paper Structure (9 sections, 3 equations, 4 figures)

This paper contains 9 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: Prediction interval by repeat for each model tested with Small and Large benchmarks. The red dashed line indicates the repeat at which the prediction interval width falls below 0.01. For Claude-3.5S and Gemini-1.5P applied to Large the total number of repeats tested is fewer than 30 owing to practical constraints.
  • Figure 2: Mean score by model for the Small benchmark (top) and the Large benchmark (bottom). Models use default settings on the left and $temperature$ = 0 with a fixed $seed$ on the right. The text on the bars shows mean score ($\bar{x}$), standard deviation ($\sigma$), worst score ($\downarrow$), best score ($\uparrow$), and the number of experimental repeats ($n$). Green bars indicate range $<$ 0.01. *Note that the $seed$ parameter cannot be set for Claude-3.5S.
  • Figure 3: Histogram of mean score for the Small benchmark for GPT-3.5T using the Azure OpenAI API (left) and the OpenAI API (right). The two distributions are statistically significantly different ($t$ = 2.51, $p$ = 0.013, $n$ = 90), indicating possible differences in the hosting of the model.
  • Figure 4: Histogram of mean score for for each model tested with Small and Large benchmarks. In most cases the number of repeats, n=30, but in some cases n is fewer owing to practical constraints; in these cases n is specified in the top right of the plot.