Table of Contents
Fetching ...

Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations

Evan Miller

TL;DR

This work reframes language-model evaluations as statistical experiments drawn from a super-population, proposing a rigorous framework for estimating precision with confidence intervals, considering question clustering, and planning experiments. It introduces variance-reduction techniques (resampling, next-token analysis) and advocates paired over unpaired comparisons when feasible, accompanied by formulas for standard errors, CI construction, and power analysis. Practical guidance includes reporting SEs alongside means, presenting pairwise results and correlations, and using cluster-adjusted methods to avoid overestimating precision. Together, these contributions aim to improve the reliability and interpretability of eval results and encourage adopting cross-disciplinary statistical practices in ML evaluation.

Abstract

Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations. Conceptualizing evaluation questions as having been drawn from an unseen super-population, we present formulas for analyzing evaluation data, measuring differences between two models, and planning an evaluation experiment. We make a number of specific recommendations for running language model evaluations and reporting experiment results in a way that minimizes statistical noise and maximizes informativeness.

Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations

TL;DR

This work reframes language-model evaluations as statistical experiments drawn from a super-population, proposing a rigorous framework for estimating precision with confidence intervals, considering question clustering, and planning experiments. It introduces variance-reduction techniques (resampling, next-token analysis) and advocates paired over unpaired comparisons when feasible, accompanied by formulas for standard errors, CI construction, and power analysis. Practical guidance includes reporting SEs alongside means, presenting pairwise results and correlations, and using cluster-adjusted methods to avoid overestimating precision. Together, these contributions aim to improve the reliability and interpretability of eval results and encourage adopting cross-disciplinary statistical practices in ML evaluation.

Abstract

Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations. Conceptualizing evaluation questions as having been drawn from an unseen super-population, we present formulas for analyzing evaluation data, measuring differences between two models, and planning an evaluation experiment. We make a number of specific recommendations for running language model evaluations and reporting experiment results in a way that minimizes statistical noise and maximizes informativeness.

Paper Structure

This paper contains 16 sections, 46 equations, 5 tables.