Table of Contents
Fetching ...

Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation

Wenbo Zhang, Hengrui Cai, Wenyu Chen

TL;DR

The paper tackles the instability of single-generation LLM benchmark scores by modeling benchmarking as a hierarchical probabilistic process that separates prompt difficulty from model randomness. By leveraging multiple generations, it reduces variance, enables estimation of a fine-grained prompt difficulty metric P(correct), and supports data-driven quality control through a data map that visualizes difficulty and semantic consistency. The approach is validated across multiple benchmarks and open-source LLMs, showing improved reliability over traditional single-sample benchmarks and revealing insights into prompt quality and model behavior. This work offers a principled framework for more robust, informative benchmarking in generative AI systems with practical implications for dataset construction and model evaluation.

Abstract

Large language models (LLMs) have demonstrated significant utility in real-world applications, exhibiting impressive capabilities in natural language processing and understanding. Benchmark evaluations are crucial for assessing the capabilities of LLMs as they can provide a comprehensive assessment of their strengths and weaknesses. However, current evaluation methods often overlook the inherent randomness of LLMs by employing deterministic generation strategies or relying on a single random sample, resulting in unaccounted sampling variance and unreliable benchmark score estimates. In this paper, we propose a hierarchical statistical model that provides a more comprehensive representation of the benchmarking process by incorporating both benchmark characteristics and LLM randomness. We show that leveraging multiple generations improves the accuracy of estimating the benchmark score and reduces variance. Multiple generations also allow us to define $\mathbb P\left(\text{correct}\right)$, a prompt-level difficulty score based on correct ratios, providing fine-grained insights into individual prompts. Additionally, we create a data map that visualizes difficulty and semantics of prompts, enabling error detection and quality control in benchmark construction.

Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation

TL;DR

The paper tackles the instability of single-generation LLM benchmark scores by modeling benchmarking as a hierarchical probabilistic process that separates prompt difficulty from model randomness. By leveraging multiple generations, it reduces variance, enables estimation of a fine-grained prompt difficulty metric P(correct), and supports data-driven quality control through a data map that visualizes difficulty and semantic consistency. The approach is validated across multiple benchmarks and open-source LLMs, showing improved reliability over traditional single-sample benchmarks and revealing insights into prompt quality and model behavior. This work offers a principled framework for more robust, informative benchmarking in generative AI systems with practical implications for dataset construction and model evaluation.

Abstract

Large language models (LLMs) have demonstrated significant utility in real-world applications, exhibiting impressive capabilities in natural language processing and understanding. Benchmark evaluations are crucial for assessing the capabilities of LLMs as they can provide a comprehensive assessment of their strengths and weaknesses. However, current evaluation methods often overlook the inherent randomness of LLMs by employing deterministic generation strategies or relying on a single random sample, resulting in unaccounted sampling variance and unreliable benchmark score estimates. In this paper, we propose a hierarchical statistical model that provides a more comprehensive representation of the benchmarking process by incorporating both benchmark characteristics and LLM randomness. We show that leveraging multiple generations improves the accuracy of estimating the benchmark score and reduces variance. Multiple generations also allow us to define , a prompt-level difficulty score based on correct ratios, providing fine-grained insights into individual prompts. Additionally, we create a data map that visualizes difficulty and semantics of prompts, enabling error detection and quality control in benchmark construction.

Paper Structure

This paper contains 16 sections, 1 theorem, 8 equations, 5 figures, 2 tables.

Key Result

Lemma 2.1

Given the hierarchical model in (stat model) and the moment estimators $\hat{\mu} =\frac{\sum_{i=1}^n\sum_{j=1}^k y_{i,j}}{nk}.$ Then $\hat{\mu}$ is an unbiased estimator for $\mu$ and its variance equals:

Figures (5)

  • Figure 1: Distribution of $\mathbb P\left(\text{correct}\right)$ of $4$ benchmarks.
  • Figure 2: Benchmark score of IFEval over different $k$.
  • Figure 3: Data map for GSM8K with Llama 70b.
  • Figure 4: Distribution of $\mathbb P\left(\text{correct}\right)$ for GSM8K and MUSR when varying temperature $T$.
  • Figure 5: Examples of detected mislabeled and ambiguous prompts in GSM8K.

Theorems & Definitions (1)

  • Lemma 2.1