Evaluating AI Evaluation: Perils and Prospects

John Burden

Evaluating AI Evaluation: Perils and Prospects

John Burden

TL;DR

The paper argues that current AI evaluation is perilous due to over-reliance on narrow benchmarks with weak construct validity, which undermines predictive power for deployment in diverse, real-world contexts. It proposes a cognitively-inspired shift toward capability-oriented evaluation, formalising task spaces with $M$, instances $mu$, and performance $psi$, and emphasizing predictive validity via $Psi(pi,M) = \mathbb{E}_{\mu \sim p_M}[\psi(\u0003pi,\u0003mu)]$. The work surveys psychometrics, SEM, IRT, and developmental psychology as sources for measuring latent capabilities, criticises Evals and benchmark-centric practices, and argues for three forward-looking trajectories: cultural change, mechanistic interpretability, and capability-oriented evaluation. Its contributions include a formal framework for task-instance evaluation, a critique of current benchmarks (e.g., HELM Classic) and Evals, and a roadmap for integrating cognitive-science methods into AI safety evaluation. The significance lies in steering AI evaluation toward a rigorous, theory-grounded science that can better predict, constrain, and safely govern increasingly capable AI systems across broad, real-world tasks.

Abstract

As AI systems appear to exhibit ever-increasing capability and generality, assessing their true potential and safety becomes paramount. This paper contends that the prevalent evaluation methods for these systems are fundamentally inadequate, heightening the risks and potential hazards associated with AI. I argue that a reformation is required in the way we evaluate AI systems and that we should look towards cognitive sciences for inspiration in our approaches, which have a longstanding tradition of assessing general intelligence across diverse species. We will identify some of the difficulties that need to be overcome when applying cognitively-inspired approaches to general-purpose AI systems and also analyse the emerging area of "Evals". The paper concludes by identifying promising research pathways that could refine AI evaluation, advancing it towards a rigorous scientific domain that contributes to the development of safe AI systems.

Evaluating AI Evaluation: Perils and Prospects

TL;DR

, instances

, and performance

, and emphasizing predictive validity via

. The work surveys psychometrics, SEM, IRT, and developmental psychology as sources for measuring latent capabilities, criticises Evals and benchmark-centric practices, and argues for three forward-looking trajectories: cultural change, mechanistic interpretability, and capability-oriented evaluation. Its contributions include a formal framework for task-instance evaluation, a critique of current benchmarks (e.g., HELM Classic) and Evals, and a roadmap for integrating cognitive-science methods into AI safety evaluation. The significance lies in steering AI evaluation toward a rigorous, theory-grounded science that can better predict, constrain, and safely govern increasingly capable AI systems across broad, real-world tasks.

Abstract

Paper Structure (21 sections, 2 equations)

This paper contains 21 sections, 2 equations.

Introduction
Formalising Tasks, Instances, and Performance
Capability-oriented Evaluation and Performance-oriented Evaluation
The Fallacy of Reification?
Evaluation Is For Prediction
Risks From Poor Evaluation
Evaluation of AI systems in Practice
Case Study: HELM Classic
Benchmark Blindness
The Problem With Evals
Evaluating Systems That (May) Have General Intelligence
Difficulties of Evaluating General Intelligences
Avoiding the Biomorphism of AI Systems
Limits of Norm-referenced Testing for AI
Evaluation Doesn't Occur in a Vacuum
...and 6 more sections

Evaluating AI Evaluation: Perils and Prospects

TL;DR

Abstract

Evaluating AI Evaluation: Perils and Prospects

Authors

TL;DR

Abstract

Table of Contents