The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks
Arda Uzunoglu, Tianjian Li, Daniel Khashabi
TL;DR
This work addresses the problem that aggregate benchmark scores can mislead conclusions about model capabilities when subdomains are unevenly represented or unevenly mastered. It introduces Harmony, an entropy-based metric that captures the uniformity of a model’s performance across subdomains within a benchmark, and couples it with a model-aware partitioning method based on predictive similarity and spectral clustering. By mapping 19 MCQA benchmarks across five model families onto a mean-variance Harmony plane, the authors show that high Harmony (high mean, low variance) yields more reliable evaluations, while low Harmony can cause distortions where a model appears strong due to a single subdomain. They validate the partitioning approach on RedundantQA, demonstrate how Harmony responds to data pruning, scaling, and token budgets, and argue for reporting Harmony alongside accuracy to enable robust, multi-dimensional evaluation and fairer progress tracking. Overall, Harmony reframes benchmark evaluation from simple averages to distributionally reliable measures of competence, with practical implications for benchmark design, reporting practices, and interpretation of model improvements.
Abstract
Benchmarks shape scientific conclusions about model capabilities and steer model development. This creates a feedback loop: stronger benchmarks drive better models, and better models demand more discriminative benchmarks. Ensuring benchmark reliability is therefore essential for trustworthy evaluation and meaningful progress. In this work, we study benchmark reliability from a distributional perspective and introduce benchmark harmony, which measures how uniformly a model's performance is distributed across the subdomains of a benchmark. We posit that high harmony is a desirable benchmark property, indicating that the aggregate metric reflects uniform competence across subdomains. Across 19 multiple-choice benchmarks and five model families, we map each benchmark onto a mean-variance plane of harmony computed across models, where high mean and low variance signal more reliable evaluation. Our analysis shows that less harmonious benchmarks can give misleading results, since overall accuracy may be disproportionately influenced by specific subdomains. For instance, ARC-Easy is overwhelmed by questions on Biological Concepts, overshadowing other critical subdomains such as Geography, Physics, Chemistry, and Environmental Science. By recommending that harmony should be reported alongside accuracy, we reframe evaluation from simple performance averages to a more robust, distributionally reliable measurement of performance.
