Table of Contents
Fetching ...

Position: Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints

Sam Bowyer, Laurence Aitchison, Desi R. Ivanova

TL;DR

The paper argues that CLT-based confidence intervals are unreliable for LLM evals when sample sizes are small or data exhibit structure (e.g., clustering, non-IID tasks). It systematically demonstrates failures across IID, clustered, independent, and paired settings, and for non-average metrics, showing that Bayesian credible intervals and certain frequentist alternatives provide valid, well-calibrated uncertainty even at small $N$. It provides practical guidance and a Python library (bayes_evals) to implement these methods, and concludes that adopting Bayesian or robust frequentist uncertainty quantification should become standard practice in modern LLM evaluations. The work emphasizes improved reliability and fairness in model comparisons and deployment decisions, especially where data are costly or highly specialized.

Abstract

Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios. We provide a simple Python library for these Bayesian methods at https://github.com/sambowyer/bayes_evals .

Position: Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints

TL;DR

The paper argues that CLT-based confidence intervals are unreliable for LLM evals when sample sizes are small or data exhibit structure (e.g., clustering, non-IID tasks). It systematically demonstrates failures across IID, clustered, independent, and paired settings, and for non-average metrics, showing that Bayesian credible intervals and certain frequentist alternatives provide valid, well-calibrated uncertainty even at small . It provides practical guidance and a Python library (bayes_evals) to implement these methods, and concludes that adopting Bayesian or robust frequentist uncertainty quantification should become standard practice in modern LLM evaluations. The work emphasizes improved reliability and fairness in model comparisons and deployment decisions, especially where data are costly or highly specialized.

Abstract

Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios. We provide a simple Python library for these Bayesian methods at https://github.com/sambowyer/bayes_evals .

Paper Structure

This paper contains 35 sections, 55 equations, 39 figures, 1 table.

Figures (39)

  • Figure 1: Error bars on LangChain tool-use benchmark. 95% intervals for model accuracy on $N{=}20$ questions. The CLT produces invalid intervals, extending beyond $[0,1]$ or collapsing to zero, highlighting its unreliability in practical settings. The alternative frequentist and Bayesian methods we advocate yield valid and well-calibrated intervals even in this small-data regime. See \ref{['app:real_data']} for more results.
  • Figure 2: IID question setting. Coverage vs. confidence level (top) and vs. interval width (bottom) for various interval-calculation methods on the value of $\theta$. While all methods approach the ideal $1-\alpha$ coverage line for large $N$, only the Bayesian credible interval and Wilson confidence intervals achieve this for small $N$.
  • Figure 3: Clustered questions setting. Coverage vs. confidence level for various interval-calculation methods on the value of $\theta$. See \ref{['app:interval_width']} for interval widths. Importantly, note that in a small-data regime, neither simple CLT nor clustered CLT intervals produce correct coverage. Methods ignoring the clustered structure of the data are shown as dotted lines.
  • Figure 4: Independent model comparison setting. Coverage vs confidence level for various interval-calculation methods when comparing two independent means $\theta_A$ and $\theta_B$ for both the difference (Diff) and odds ratio (OR) metrics. The diagonal gray dashed line represents the expected coverage, $1-\alpha$. The CLT is not applicable to the OR.
  • Figure 5: Paired model comparison setting. Coverage vs. confidence level for various interval-calculation methods on the value of $\theta_A - \theta_B$. Methods ignoring the paired structure of the data---assuming instead IID questions and answers from model A and from model B, as per \ref{['sec:failure_simple_ci']}---are shown as dotted lines.
  • ...and 34 more figures

Theorems & Definitions (1)

  • Remark 1: Bayesian model comparison