Table of Contents
Fetching ...

Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs

Longyuan Zhu, Hairan Hua, Linlin Miao, Bing Zhao

TL;DR

The paper introduces Benchmark Health Index (BHI), a data-driven framework to audit LLM benchmarks along three orthogonal axes: Capability Discrimination ($S_{Disc}$), Anti-Saturation ($S_{AS}$), and Impact ($S_{Imp}$). It operationalizes $S_{Disc}$ via Effective Differentiation Ratio (EDR) and Robust Coefficient of Variation (RCV) with normalization and SDM weighting, calibrates model capabilities using LOBO with the Fourth-root Log-Balance Model, and computes $S_{AS}$ through Static Weighted Resistance and Dynamic Saturation Projection. $S_{Imp}$ combines Industry Adoption and Community Heat using CV-based weighting, and all three axes feed a CRITIC-weighted final BHI score, validated on 106 benchmarks across 91 2025 models. The results reveal structural distortions in the evaluation ecosystem, highlight frontier, mismatched, and domain-specific benchmarks, and offer actionable guidance for benchmark governance and lifecycle management. The framework demonstrates robustness to data sparsity and noise and emphasizes the need for dynamic, high-headroom evaluation protocols to guide future LLM development and benchmarking practices.

Abstract

Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data-driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti-Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark's expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice-shaping power. By distilling 106 validated benchmarks from the technical reports of 91 representative models in 2025, we systematically characterize the evaluation landscape. BHI is the first framework to quantify benchmark health at a macro level, providing a principled basis for benchmark selection and enabling dynamic lifecycle management for next-generation evaluation protocols.

Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs

TL;DR

The paper introduces Benchmark Health Index (BHI), a data-driven framework to audit LLM benchmarks along three orthogonal axes: Capability Discrimination (), Anti-Saturation (), and Impact (). It operationalizes via Effective Differentiation Ratio (EDR) and Robust Coefficient of Variation (RCV) with normalization and SDM weighting, calibrates model capabilities using LOBO with the Fourth-root Log-Balance Model, and computes through Static Weighted Resistance and Dynamic Saturation Projection. combines Industry Adoption and Community Heat using CV-based weighting, and all three axes feed a CRITIC-weighted final BHI score, validated on 106 benchmarks across 91 2025 models. The results reveal structural distortions in the evaluation ecosystem, highlight frontier, mismatched, and domain-specific benchmarks, and offer actionable guidance for benchmark governance and lifecycle management. The framework demonstrates robustness to data sparsity and noise and emphasizes the need for dynamic, high-headroom evaluation protocols to guide future LLM development and benchmarking practices.

Abstract

Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data-driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti-Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark's expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice-shaping power. By distilling 106 validated benchmarks from the technical reports of 91 representative models in 2025, we systematically characterize the evaluation landscape. BHI is the first framework to quantify benchmark health at a macro level, providing a principled basis for benchmark selection and enabling dynamic lifecycle management for next-generation evaluation protocols.
Paper Structure (74 sections, 16 equations, 8 figures, 6 tables)

This paper contains 74 sections, 16 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Top 10 Benchmarks Ranked by BHI
  • Figure 2: Overview of BHI Framework. Top Row: Three critical challenges characterize current benchmarks: diminished discriminative power, performance-driven saturation, and a decoupling of benchmark influence from real-world performance. Middle Row: We introduce three data-driven metrics: Capability Discrimination ($S_{Disc}$) quantifies benchmark sensitivity, Anti-Saturation ($S_{AS}$) assesses challenge headroom, and Impact ($S_{Imp}$) gauges authentic model recognition. Bottom Row: The final BHI score is synthesized through an objective weighting mechanism.
  • Figure 3: Chronological distribution of the 91 mainstream large language models released throughout 2025.
  • Figure 4: The BHI Data Architecture. (a) The systematic process of data constriction. (b) The taxonomic distribution of the validated benchmark set across 14 functional domains. (c) The distribution of evaluated models across diverse global AI vendors.
  • Figure 5: Performance evolution of selected benchmarks. Regression slopes ($k$) indicate the saturation speed of each evaluator.
  • ...and 3 more figures