Evaluating AI Evaluation: Perils and Prospects
John Burden
TL;DR
The paper argues that current AI evaluation is perilous due to over-reliance on narrow benchmarks with weak construct validity, which undermines predictive power for deployment in diverse, real-world contexts. It proposes a cognitively-inspired shift toward capability-oriented evaluation, formalising task spaces with $M$, instances $mu$, and performance $psi$, and emphasizing predictive validity via $Psi(pi,M) = \mathbb{E}_{\mu \sim p_M}[\psi(\u0003pi,\u0003mu)]$. The work surveys psychometrics, SEM, IRT, and developmental psychology as sources for measuring latent capabilities, criticises Evals and benchmark-centric practices, and argues for three forward-looking trajectories: cultural change, mechanistic interpretability, and capability-oriented evaluation. Its contributions include a formal framework for task-instance evaluation, a critique of current benchmarks (e.g., HELM Classic) and Evals, and a roadmap for integrating cognitive-science methods into AI safety evaluation. The significance lies in steering AI evaluation toward a rigorous, theory-grounded science that can better predict, constrain, and safely govern increasingly capable AI systems across broad, real-world tasks.
Abstract
As AI systems appear to exhibit ever-increasing capability and generality, assessing their true potential and safety becomes paramount. This paper contends that the prevalent evaluation methods for these systems are fundamentally inadequate, heightening the risks and potential hazards associated with AI. I argue that a reformation is required in the way we evaluate AI systems and that we should look towards cognitive sciences for inspiration in our approaches, which have a longstanding tradition of assessing general intelligence across diverse species. We will identify some of the difficulties that need to be overcome when applying cognitively-inspired approaches to general-purpose AI systems and also analyse the emerging area of "Evals". The paper concludes by identifying promising research pathways that could refine AI evaluation, advancing it towards a rigorous scientific domain that contributes to the development of safe AI systems.
