General Scales Unlock AI Evaluation with Explanatory and Predictive Power
Lexin Zhou, Lorenzo Pacchiardi, Fernando Martínez-Plumed, Katherine M. Collins, Yael Moros-Daval, Seraphina Zhang, Qinlin Zhao, Yitian Huang, Luning Sun, Jonathan E. Prunty, Zongqian Li, Pablo Sánchez-García, Kexin Jiang Chen, Pablo A. M. Casares, Jiyun Zu, John Burden, Behzad Mehrbakhsh, David Stillwell, Manuel Cebrian, Jindong Wang, Peter Henderson, Sherry Tongshuang Wu, Patrick C. Kyllonen, Lucy Cheke, Xing Xie, José Hernández-Orallo
TL;DR
The paper introduces a construct-oriented AI evaluation framework designed to provide both explanatory and predictive power across a wide range of tasks. It defines 18 cognitive-demand rubrics (-1DeLeAn) plus Unguessability, totaling 19 dimensions, and assembles them into an automated, openly accessible -1ADeLe battery that can be annotated by LLMs to produce interpretable demand and ability profiles for individual AI systems. By plotting subject characteristic curves for each dimension and deriving 18-dimensional ability profiles, the method enables causal explanations of benchmark results and strong instance-level predictions, including out-of-distribution performance, via demand-based assessors. Empirically, the authors annotate 16,108 instances across 63 tasks from 20 benchmarks with 18 demands and one extraneous dimension, showing robust human-LLM agreement and meaningful separability across dimensions. The approach yields an in-distribution AUROC around 0.84 with excellent calibration, and maintains substantial predictive power under task- and benchmark-out-of-distribution scenarios, outperforming black-box baselines and demonstrating the value of interpretable, general-purpose scales for AI evaluation. The work also provides an open-source platform and a scalable methodology for extending the demand/ability framework to additional modalities, safety considerations, and regulatory deployment.
Abstract
Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead. (Collaborative platform: https://kinds-of-intelligence-cfi.github.io/ADELE.)
