Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs
Longyuan Zhu, Hairan Hua, Linlin Miao, Bing Zhao
TL;DR
The paper introduces Benchmark Health Index (BHI), a data-driven framework to audit LLM benchmarks along three orthogonal axes: Capability Discrimination ($S_{Disc}$), Anti-Saturation ($S_{AS}$), and Impact ($S_{Imp}$). It operationalizes $S_{Disc}$ via Effective Differentiation Ratio (EDR) and Robust Coefficient of Variation (RCV) with normalization and SDM weighting, calibrates model capabilities using LOBO with the Fourth-root Log-Balance Model, and computes $S_{AS}$ through Static Weighted Resistance and Dynamic Saturation Projection. $S_{Imp}$ combines Industry Adoption and Community Heat using CV-based weighting, and all three axes feed a CRITIC-weighted final BHI score, validated on 106 benchmarks across 91 2025 models. The results reveal structural distortions in the evaluation ecosystem, highlight frontier, mismatched, and domain-specific benchmarks, and offer actionable guidance for benchmark governance and lifecycle management. The framework demonstrates robustness to data sparsity and noise and emphasizes the need for dynamic, high-headroom evaluation protocols to guide future LLM development and benchmarking practices.
Abstract
Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data-driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti-Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark's expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice-shaping power. By distilling 106 validated benchmarks from the technical reports of 91 representative models in 2025, we systematically characterize the evaluation landscape. BHI is the first framework to quantify benchmark health at a macro level, providing a principled basis for benchmark selection and enabling dynamic lifecycle management for next-generation evaluation protocols.
