Position: Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints
Sam Bowyer, Laurence Aitchison, Desi R. Ivanova
TL;DR
The paper argues that CLT-based confidence intervals are unreliable for LLM evals when sample sizes are small or data exhibit structure (e.g., clustering, non-IID tasks). It systematically demonstrates failures across IID, clustered, independent, and paired settings, and for non-average metrics, showing that Bayesian credible intervals and certain frequentist alternatives provide valid, well-calibrated uncertainty even at small $N$. It provides practical guidance and a Python library (bayes_evals) to implement these methods, and concludes that adopting Bayesian or robust frequentist uncertainty quantification should become standard practice in modern LLM evaluations. The work emphasizes improved reliability and fairness in model comparisons and deployment decisions, especially where data are costly or highly specialized.
Abstract
Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios. We provide a simple Python library for these Bayesian methods at https://github.com/sambowyer/bayes_evals .
