Position: AI Evaluation Should Learn from How We Test Humans
Yan Zhuang, Qi Liu, Zachary A. Pardos, Patrick C. Kyllonen, Jiyun Zu, Zhenya Huang, Shijin Wang, Enhong Chen
TL;DR
The paper argues that current AI evaluation largely relies on large static benchmarks that are costly, prone to data contamination, and limited in informative content. It proposes adopting adaptive testing grounded in psychometrics to estimate latent abilities (θ) and item characteristics, enabling uncertainty quantification, reduced evaluation dimensionality, and better interpretability and comparability across benchmarks. By detailing two phases—annotating item characteristics (difficulty, discrimination, guessing) and interactive dynamic evaluation (adaptive item selection via Fisher information)—the authors illustrate how AI assessments can become more efficient and robust. They also discuss extending the framework to non-ability traits (ethics, bias) and a suite of measurement models beyond traditional IRT, highlighting opportunities and challenges for broad adoption and the potential need for new disciplines like Machine Psychometrics.
Abstract
As AI systems continue to evolve, their rigorous evaluation becomes crucial for their development and deployment. Researchers have constructed various large-scale benchmarks to determine their capabilities, typically against a gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high evaluation costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Position, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics or value of each test item in the benchmark, and tailoring each model's evaluation instead of relying on a fixed test set. This paradigm provides robust ability estimation, uncovering the latent traits underlying a model's observed scores. This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
