Table of Contents
Fetching ...

The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks

Arda Uzunoglu, Tianjian Li, Daniel Khashabi

TL;DR

This work addresses the problem that aggregate benchmark scores can mislead conclusions about model capabilities when subdomains are unevenly represented or unevenly mastered. It introduces Harmony, an entropy-based metric that captures the uniformity of a model’s performance across subdomains within a benchmark, and couples it with a model-aware partitioning method based on predictive similarity and spectral clustering. By mapping 19 MCQA benchmarks across five model families onto a mean-variance Harmony plane, the authors show that high Harmony (high mean, low variance) yields more reliable evaluations, while low Harmony can cause distortions where a model appears strong due to a single subdomain. They validate the partitioning approach on RedundantQA, demonstrate how Harmony responds to data pruning, scaling, and token budgets, and argue for reporting Harmony alongside accuracy to enable robust, multi-dimensional evaluation and fairer progress tracking. Overall, Harmony reframes benchmark evaluation from simple averages to distributionally reliable measures of competence, with practical implications for benchmark design, reporting practices, and interpretation of model improvements.

Abstract

Benchmarks shape scientific conclusions about model capabilities and steer model development. This creates a feedback loop: stronger benchmarks drive better models, and better models demand more discriminative benchmarks. Ensuring benchmark reliability is therefore essential for trustworthy evaluation and meaningful progress. In this work, we study benchmark reliability from a distributional perspective and introduce benchmark harmony, which measures how uniformly a model's performance is distributed across the subdomains of a benchmark. We posit that high harmony is a desirable benchmark property, indicating that the aggregate metric reflects uniform competence across subdomains. Across 19 multiple-choice benchmarks and five model families, we map each benchmark onto a mean-variance plane of harmony computed across models, where high mean and low variance signal more reliable evaluation. Our analysis shows that less harmonious benchmarks can give misleading results, since overall accuracy may be disproportionately influenced by specific subdomains. For instance, ARC-Easy is overwhelmed by questions on Biological Concepts, overshadowing other critical subdomains such as Geography, Physics, Chemistry, and Environmental Science. By recommending that harmony should be reported alongside accuracy, we reframe evaluation from simple performance averages to a more robust, distributionally reliable measurement of performance.

The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks

TL;DR

This work addresses the problem that aggregate benchmark scores can mislead conclusions about model capabilities when subdomains are unevenly represented or unevenly mastered. It introduces Harmony, an entropy-based metric that captures the uniformity of a model’s performance across subdomains within a benchmark, and couples it with a model-aware partitioning method based on predictive similarity and spectral clustering. By mapping 19 MCQA benchmarks across five model families onto a mean-variance Harmony plane, the authors show that high Harmony (high mean, low variance) yields more reliable evaluations, while low Harmony can cause distortions where a model appears strong due to a single subdomain. They validate the partitioning approach on RedundantQA, demonstrate how Harmony responds to data pruning, scaling, and token budgets, and argue for reporting Harmony alongside accuracy to enable robust, multi-dimensional evaluation and fairer progress tracking. Overall, Harmony reframes benchmark evaluation from simple averages to distributionally reliable measures of competence, with practical implications for benchmark design, reporting practices, and interpretation of model improvements.

Abstract

Benchmarks shape scientific conclusions about model capabilities and steer model development. This creates a feedback loop: stronger benchmarks drive better models, and better models demand more discriminative benchmarks. Ensuring benchmark reliability is therefore essential for trustworthy evaluation and meaningful progress. In this work, we study benchmark reliability from a distributional perspective and introduce benchmark harmony, which measures how uniformly a model's performance is distributed across the subdomains of a benchmark. We posit that high harmony is a desirable benchmark property, indicating that the aggregate metric reflects uniform competence across subdomains. Across 19 multiple-choice benchmarks and five model families, we map each benchmark onto a mean-variance plane of harmony computed across models, where high mean and low variance signal more reliable evaluation. Our analysis shows that less harmonious benchmarks can give misleading results, since overall accuracy may be disproportionately influenced by specific subdomains. For instance, ARC-Easy is overwhelmed by questions on Biological Concepts, overshadowing other critical subdomains such as Geography, Physics, Chemistry, and Environmental Science. By recommending that harmony should be reported alongside accuracy, we reframe evaluation from simple performance averages to a more robust, distributionally reliable measurement of performance.

Paper Structure

This paper contains 65 sections, 14 equations, 16 figures, 24 tables.

Figures (16)

  • Figure 1: Pipeline of evaluating Harmony for a given benchmark. Step 1: We partition the benchmark into semantic clusters (subdomains or skills). Step 2: We gather each model's performance on every cluster. Step 3: We calculate the harmony --- the uniformity of the distribution of performance across subdomains. We posit that high Harmony implies that aggregate metrics capture broad competence, whereas low Harmony obscures strengths and weaknesses.
  • Figure 2: Validation of our approach on (a) RedundantQA and (b) MMLU high school subtasks. Estimated Harmony strongly correlates with the ground truth and clearly separates low from highHarmony variants. Each dot represents one variant averaged across five random seeds.
  • Figure 3: Mean-variance plane for Harmony across (a) MCQA Benchmarks and (b) MMLU subtasks. Each point represents a benchmark or subtask plotted by the Harmony mean ($\mu_H(\mathcal{B})$) and variance ($\sigma_H^2(\mathcal{B})$) over 36 models. Upper-left (high mean, low variance) indicates higherbenchmark reliability; rightward (higher variance) and downward (lower mean) shifts signal diminished reliability. The star at top-left represents an optimal benchmark. Harmony mean/variance are defined in Eq. \ref{['eq:mu_and_sigma']}.
  • Figure 4: Balancing benchmarks via pruning. We remove overly similar items with a pruning rate inversely proportional to Harmony. Top row shows more harmonious benchmarks, where accuracy remains stable as Harmony increases. Bottom row shows less harmonious benchmarks, where Harmony rises and accuracy shifts significantly. Model-averaged Harmony values for the original and pruned benchmarks are reported in parentheses in the legends.
  • Figure 5: Model size vs. Harmony. Scaling trends are family-specific: Qwen and Llama show negative correlations, while Gemma and OLMo show positive correlations (larger models perform more uniformly). Thus, parameter count alone is not predictive of performance uniformity. Y-axis shows each model’s average Harmony over all benchmarks (§\ref{['subsec:exp_setup']}).
  • ...and 11 more figures