Table of Contents
Fetching ...

General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Lexin Zhou, Lorenzo Pacchiardi, Fernando Martínez-Plumed, Katherine M. Collins, Yael Moros-Daval, Seraphina Zhang, Qinlin Zhao, Yitian Huang, Luning Sun, Jonathan E. Prunty, Zongqian Li, Pablo Sánchez-García, Kexin Jiang Chen, Pablo A. M. Casares, Jiyun Zu, John Burden, Behzad Mehrbakhsh, David Stillwell, Manuel Cebrian, Jindong Wang, Peter Henderson, Sherry Tongshuang Wu, Patrick C. Kyllonen, Lucy Cheke, Xing Xie, José Hernández-Orallo

TL;DR

The paper introduces a construct-oriented AI evaluation framework designed to provide both explanatory and predictive power across a wide range of tasks. It defines 18 cognitive-demand rubrics (-1DeLeAn) plus Unguessability, totaling 19 dimensions, and assembles them into an automated, openly accessible -1ADeLe battery that can be annotated by LLMs to produce interpretable demand and ability profiles for individual AI systems. By plotting subject characteristic curves for each dimension and deriving 18-dimensional ability profiles, the method enables causal explanations of benchmark results and strong instance-level predictions, including out-of-distribution performance, via demand-based assessors. Empirically, the authors annotate 16,108 instances across 63 tasks from 20 benchmarks with 18 demands and one extraneous dimension, showing robust human-LLM agreement and meaningful separability across dimensions. The approach yields an in-distribution AUROC around 0.84 with excellent calibration, and maintains substantial predictive power under task- and benchmark-out-of-distribution scenarios, outperforming black-box baselines and demonstrating the value of interpretable, general-purpose scales for AI evaluation. The work also provides an open-source platform and a scalable methodology for extending the demand/ability framework to additional modalities, safety considerations, and regulatory deployment.

Abstract

Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead. (Collaborative platform: https://kinds-of-intelligence-cfi.github.io/ADELE.)

General Scales Unlock AI Evaluation with Explanatory and Predictive Power

TL;DR

The paper introduces a construct-oriented AI evaluation framework designed to provide both explanatory and predictive power across a wide range of tasks. It defines 18 cognitive-demand rubrics (-1DeLeAn) plus Unguessability, totaling 19 dimensions, and assembles them into an automated, openly accessible -1ADeLe battery that can be annotated by LLMs to produce interpretable demand and ability profiles for individual AI systems. By plotting subject characteristic curves for each dimension and deriving 18-dimensional ability profiles, the method enables causal explanations of benchmark results and strong instance-level predictions, including out-of-distribution performance, via demand-based assessors. Empirically, the authors annotate 16,108 instances across 63 tasks from 20 benchmarks with 18 demands and one extraneous dimension, showing robust human-LLM agreement and meaningful separability across dimensions. The approach yields an in-distribution AUROC around 0.84 with excellent calibration, and maintains substantial predictive power under task- and benchmark-out-of-distribution scenarios, outperforming black-box baselines and demonstrating the value of interpretable, general-purpose scales for AI evaluation. The work also provides an open-source platform and a scalable methodology for extending the demand/ability framework to additional modalities, safety considerations, and regulatory deployment.

Abstract

Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead. (Collaborative platform: https://kinds-of-intelligence-cfi.github.io/ADELE.)

Paper Structure

This paper contains 56 sections, 2 equations, 31 figures, 18 tables.

Figures (31)

  • Figure 1: Processes to explain and predict performance for new systems and benchmarks. Top: "System Process": Steps for each new AI system: (1) Run the new system on the annotated-demand-levels (-1ADeLe) Battery, (2) Plot characteristic curves for all dimensions and extract the ability profile for the system, and, optionally, (3) Train a simple assessor using the annotated levels as inputs and the score as output. Bottom: "Task Process": Steps for each new task or benchmark: (A) Apply the demand-level-annotation (-1DeLeAn) rubrics to the new tasks using a standard LLM, (B) Get demand histograms and demand profiles that explain what demands the tasks require, and, optionally, (C) Predict performance for the new tasks for any system that has built an assessor after the "System Process". Assessors based on the demand profile have especially higher predictive power in out-of-distribution settings than other baseline assessors, anticipating validity in novel situations.
  • Figure 2: Level annotations of five items (from benchmarks OmniMath, TimeQA, MedCalcBench, MMLU-Pro, TruthQuest, respectively) using the -1DeLeAn rubric set by GPT-4o. The listed demands in the table (from left to right) follow the same order shown in Table \ref{['tab:dimensions']} (from top to bottom): 11 being primordial, 5 knowledge and 3 extraneous.
  • Figure 3: The characteristic curve of Llama-3.1-405B-Instruct for dimension -0.5KNn(-0.5Knowledge of Natural Sciences) on the -1ADeLe battery. The $x$-axis shows the demand levels 0-5 for -0.5KNn and the $y$-axis the average performance for that level (probability of success). As usual, level 0 has no points left (0 never dominates), but in this case we see a situation with no point for 1. The curve is a logistic fit in the output range (0,1).
  • Figure 4: Correlations of the demand level using all the items in the -1ADeLe battery for all the pairs of the 18 demands and the special dimension UG (Unguessability). It also includes the success (i.e., correctness at instance-level) of all the subject LLMs considered in the experiments.
  • Figure 5: Distribution of level frequencies for the 18 demands using all the 16,108 instances in the -1ADeLe battery v.1.0. Dimensions such as CEe (Verbal Expression), MS (Mind modelling and social cognition) and SNs (Spatial Reasoning and Navigation - Spatial) have low proportion of items of high level, but this is in accordance with the focus of LLM evaluation on factual questions with no navigation or full social interaction. Future versions of the battery for agents or multimodal scenarios can increase the number and breadth of the dimensions.
  • ...and 26 more figures