Table of Contents
Fetching ...

Quantifying Variance in Evaluation Benchmarks

Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, Dieuwke Hupkes

TL;DR

This work tackles the overlooked problem of variance in evaluation benchmarks for LLMs by introducing seed variance, confidence intervals, and monotonicity as formal variance metrics applied to 13 benchmarks across 280 models. It demonstrates that continuous performance metrics and alternative task formulations (e.g., cloze MMLU) can significantly reduce signal noise and improve monotonicity during training, particularly for smaller models. Across item analysis and item response theory, the study finds these human-testing-inspired methods generally ineffective at reducing variance for LLM benchmarks and can even inflate variance in some cases. The findings provide practical guidance for practitioners to account for variance when comparing models and suggest LM-specific evaluation strategies to obtain more reliable progress signals.

Abstract

Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale ($\sim$7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models.

Quantifying Variance in Evaluation Benchmarks

TL;DR

This work tackles the overlooked problem of variance in evaluation benchmarks for LLMs by introducing seed variance, confidence intervals, and monotonicity as formal variance metrics applied to 13 benchmarks across 280 models. It demonstrates that continuous performance metrics and alternative task formulations (e.g., cloze MMLU) can significantly reduce signal noise and improve monotonicity during training, particularly for smaller models. Across item analysis and item response theory, the study finds these human-testing-inspired methods generally ineffective at reducing variance for LLM benchmarks and can even inflate variance in some cases. The findings provide practical guidance for practitioners to account for variance when comparing models and suggest LM-specific evaluation strategies to obtain more reliable progress signals.

Abstract

Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale (7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models.
Paper Structure (32 sections, 2 equations, 9 figures, 7 tables)

This paper contains 32 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Development of model performance over time. Boxplots for both discrete and continous metrics depicting the model improvement over time for ARC-C, GSM8k, and HumanEval. Top row depicts discrete metrics for each of the benchmarks, and the bottom row is composed of the continuous metrics.
  • Figure 2: Development of model performance over time. In $(a)$, we show the boxplots for the two MMLU variants. The top row is for the discrete metric (accuracy) and bottom row for the continuous metric (probability mass of the correct answer). In $(b)$, we show the comparison of the standard (choice) and cloze variants on a Llama-2 13B model trained from scratch.
  • Figure 3: Item analysis results on GSM8k and ARC-C. Results on additional benchmarks provided in \ref{['appx:item_analysis_extra_results']}. First column shows a scatter plot of item difficulty (x-axis) vs item discrimination (y-axis). Second column shows a scatter plot of item discrimination calculated over models from the train or test set of the difficulty split. Third column is the same as the second, except on the random split. As expected (since train and test splits come from the same distribution), discrimination on train models for this split is positively correlated to discrimination on test models. Fourth, fifth, and sixth columns show the effects of iteratively removing up to 20% of items (based on discrimination) on the mean (fourth column), standard error (fifth column) of model performance on the test set from the difficulty split by looking at the delta. Error bars indicate 95% confidence intervals in the delta. Monotonicity (sixth column) is calculated over the 10 runs from \ref{['subsec:models']}. Orange curves show effects from randomly removing points, as a baseline.
  • Figure 4: Tiny Benchmarks Means and Standard Errors of the mean (proportional to 95% CI).
  • Figure 5: Development of model performance over time. Boxplots for both discrete and continous metrics depicting the model improvement over time for COPA, Hellaswag, PIQA, and SIQA. Top row depicts discrete metrics for each of the benchmarks, and the bottom row is composed of the continuous metrics.
  • ...and 4 more figures