Position: AI Evaluation Should Learn from How We Test Humans

Yan Zhuang; Qi Liu; Zachary A. Pardos; Patrick C. Kyllonen; Jiyun Zu; Zhenya Huang; Shijin Wang; Enhong Chen

Position: AI Evaluation Should Learn from How We Test Humans

Yan Zhuang, Qi Liu, Zachary A. Pardos, Patrick C. Kyllonen, Jiyun Zu, Zhenya Huang, Shijin Wang, Enhong Chen

TL;DR

The paper argues that current AI evaluation largely relies on large static benchmarks that are costly, prone to data contamination, and limited in informative content. It proposes adopting adaptive testing grounded in psychometrics to estimate latent abilities (θ) and item characteristics, enabling uncertainty quantification, reduced evaluation dimensionality, and better interpretability and comparability across benchmarks. By detailing two phases—annotating item characteristics (difficulty, discrimination, guessing) and interactive dynamic evaluation (adaptive item selection via Fisher information)—the authors illustrate how AI assessments can become more efficient and robust. They also discuss extending the framework to non-ability traits (ethics, bias) and a suite of measurement models beyond traditional IRT, highlighting opportunities and challenges for broad adoption and the potential need for new disciplines like Machine Psychometrics.

Abstract

As AI systems continue to evolve, their rigorous evaluation becomes crucial for their development and deployment. Researchers have constructed various large-scale benchmarks to determine their capabilities, typically against a gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high evaluation costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Position, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics or value of each test item in the benchmark, and tailoring each model's evaluation instead of relying on a fixed test set. This paradigm provides robust ability estimation, uncovering the latent traits underlying a model's observed scores. This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.

Position: AI Evaluation Should Learn from How We Test Humans

TL;DR

Abstract

Paper Structure (30 sections, 5 equations, 13 figures, 1 table)

This paper contains 30 sections, 5 equations, 13 figures, 1 table.

Introduction
Psychometrics Enables Scientific Evaluation
Ability-Oriented Evaluation
Capturing Uncertainty in Performance.
Mitigating the Curse of Dimensionality.
Interpretability and Comparability.
Not All Items Are Equally Important
Identifying Annotation Errors and Low-Quality Items.
Identifying Data Contamination.
Adaptive Testing Conceptualization for AI
Item Characteristics Annotation
Interactive Dynamic Model Evaluation
Core Mechanisms Driving Adaptive Testing
Opportunities and Challenges
Diversified and Deep Measurement Methods.
...and 15 more sections

Figures (13)

Figure 1: The traditional benchmarking paradigm for AI. However, the reliability of evaluation results can be compromised by several factors, including item's quality (e.g., redundancy, contamination, or errors) and the increasing complexity of AI behaviors.
Figure 2: Toy example comparing traditional evaluation metrics with psychometric metrics: a. Traditional accuracy-based metrics are unstable when using random subsets of items, as they rely solely on observed outcomes and cannot ensure subset performance reflects the full dataset. b. Psychometric methods infer ability from limited responses by considering item characteristics. For example, if an AI system answers a 0.8-difficulty item incorrectly but a 0.6-difficulty item correctly, its ability likely lies between 0.6 and 0.8.
Figure 3: Examples of item characteristics from benchmarks: SSTB (sentiment analysis), SQuAD (reading comprehension QA), and MedQA (medical QA) across three factors: difficulty, discrimination, and guessing. These factors are estimated via parameter analysis of model responses. (a) Difficulty ($\beta$): Higher difficulty means a lower probability of a correct response at a fixed ability level. For example, the first example’s ambiguous tone makes it harder to classify compared to the straightforward second example. (b) Discrimination ($\alpha$): Highly discriminative items distinguish between similar ability levels. The first example’s plausible distractors (e.g., "the Armenian state") increase discrimination, while the second example has negative discrimination due to annotation errors. (c) Guessing factor ($c$): This represents the likelihood of low-ability test-takers guessing correctly. The first item’s hallmark features of anorexia nervosa, allowing it to be correctly answered even with minimal specific knowledge or common sense. The first two cases are adapted from lalor2018understandingrodriguez2021evaluation. More detailed information about item characteristics can be found in Appendix \ref{['app_char']}.
Figure 4: Using psychometric methods to detect data contamination in AI evaluation. On one hand, contamination can be identified through anomalous behavior of AI models, such as inconsistencies in their performance on contaminated samples compared to their overall behavior. On the other hand, item characteristics, such as the guessing parameter, may also indicate potential contamination.
Figure 5: Three Reasons for the Effectiveness of Psychometrics in AI System Evaluation: a. the transformation of problem nature, b. the interrelatedness of benchmarks, and c. the universal laws exhibited by AI systems.
...and 8 more figures

Position: AI Evaluation Should Learn from How We Test Humans

TL;DR

Abstract

Position: AI Evaluation Should Learn from How We Test Humans

Authors

TL;DR

Abstract

Table of Contents

Figures (13)