Table of Contents
Fetching ...

Position: Science of AI Evaluation Requires Item-level Benchmark Data

Han Jiang, Susu Zhang, Xiaoyuan Yi, Xing Xie, Ziang Xiao

Abstract

AI evaluations have become the primary evidence for deploying generative AI systems across high-stakes domains. However, current evaluation paradigms often exhibit systemic validity failures. These issues, ranging from unjustified design choices to misaligned metrics, remain intractable without a principled framework for gathering validity evidence and conducting granular diagnostic analysis. In this position paper, we argue that item-level AI benchmark data is essential for establishing a rigorous science of AI evaluation. Item-level analysis enables fine-grained diagnostics and principled validation of benchmarks. We substantiate this position by dissecting current validity failures and revisiting evaluation paradigms across computer science and psychometrics. Through illustrative analyses of item properties and latent constructs, we demonstrate the unique insights afforded by item-level data. To catalyze community-wide adoption, we introduce OpenEval, a growing repository of item-level benchmark data designed supporting evidence-centered AI evaluation.

Position: Science of AI Evaluation Requires Item-level Benchmark Data

Abstract

AI evaluations have become the primary evidence for deploying generative AI systems across high-stakes domains. However, current evaluation paradigms often exhibit systemic validity failures. These issues, ranging from unjustified design choices to misaligned metrics, remain intractable without a principled framework for gathering validity evidence and conducting granular diagnostic analysis. In this position paper, we argue that item-level AI benchmark data is essential for establishing a rigorous science of AI evaluation. Item-level analysis enables fine-grained diagnostics and principled validation of benchmarks. We substantiate this position by dissecting current validity failures and revisiting evaluation paradigms across computer science and psychometrics. Through illustrative analyses of item properties and latent constructs, we demonstrate the unique insights afforded by item-level data. To catalyze community-wide adoption, we introduce OpenEval, a growing repository of item-level benchmark data designed supporting evidence-centered AI evaluation.

Paper Structure

This paper contains 20 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Benchmark-level accuracy distributions for 66 pre–Nov. 2023 models on MMLU and 72 post–Jun. 2024 models on MMLU- Pro. Results are from HELM-Classic and HELM-Capabilities.
  • Figure 2: Item characteristic distributions for MMLU and MMLU-Pro. Item difficulty is transformed to $\text{Diff}_i\!=\!0.5\!-\!p_i$, with higher values correspond to harder items.
  • Figure 3: ICCs for three items in MMLU.
  • Figure 4: Item clusters on BabiQA based on factor loadings.
  • Figure 5: Convergent/discriminant evidence of the four sub-constructs (#1 - # 4) on MMLU-Pro.
  • ...and 4 more figures