Table of Contents
Fetching ...

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Hongli Zhou, Hui Huang, Ziqing Zhao, Lvyuan Han, Huicheng Wang, Kehai Chen, Muyun Yang, Wei Bao, Jian Dong, Bing Xu, Conghui Zhu, Hailong Cao, Tiejun Zhao

TL;DR

This work critiques large language model benchmarks for inconsistent leaderboards and weak separability, and introduces PSN-IRT, a two-branch neural-I RT framework that estimates a model-ability score $\theta$ and item parameters $(a,b,c,d)$ within a $4$-parameter logistic model to diagnose benchmark quality. Applying PSN-IRT to 11 diverse benchmarks across 12 models, the authors show improved parameter estimation and rank reliability over traditional IRT, and better alignment with human preferences than many existing analyses. They reveal pervasive issues in current benchmarks, including item saturation, insufficient difficulty ceilings, data contamination via high guessing rates, and flawed item design, while demonstrating that strategically curated, high-information item subsets can yield stronger discriminability with far fewer items. Overall, PSN-IRT provides a practical, interpretable pathway to more reliable, efficient benchmarking and to constructing benchmarks that better reflect true model capabilities.

Abstract

The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

TL;DR

This work critiques large language model benchmarks for inconsistent leaderboards and weak separability, and introduces PSN-IRT, a two-branch neural-I RT framework that estimates a model-ability score and item parameters within a -parameter logistic model to diagnose benchmark quality. Applying PSN-IRT to 11 diverse benchmarks across 12 models, the authors show improved parameter estimation and rank reliability over traditional IRT, and better alignment with human preferences than many existing analyses. They reveal pervasive issues in current benchmarks, including item saturation, insufficient difficulty ceilings, data contamination via high guessing rates, and flawed item design, while demonstrating that strategically curated, high-information item subsets can yield stronger discriminability with far fewer items. Overall, PSN-IRT provides a practical, interpretable pathway to more reliable, efficient benchmarking and to constructing benchmarks that better reflect true model capabilities.

Abstract

The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.

Paper Structure

This paper contains 36 sections, 9 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Illustration of weak separability and ranking inconsistencies in LLM benchmarks.
  • Figure 2: The illustration of our proposed PSN-IRT. Separate neural networks estimate model-ability ($\theta$) and item parameters ($a,b,c,d$), which are then combined via the IRT formula to predict the probability of a correct response. After that, the networks can be leveraged for estimating properties for models or items, respectively.
  • Figure 3: Distribution of item-level properties across 11 LLM benchmarks.
  • Figure 4: Scatter plots showing the relationship between item difficulty and discriminability across 11 benchmarks. Each plot highlights how difficulty affects discriminability within a dataset.
  • Figure 5: Examples of benchmark items exhibiting high and low estimated values for key psychometric parameters, as determined by PSN-IRT. Each pair shows an item with a high parameter value alongside one with a low value for the specified characteristic.