From Human-Level AI Tales to AI Leveling Human Scales

Peter Romero; Fernando Martínez-Plumed; Zachary R. Tyler; Matthieu Téhénan; Sipeng Chen; Álvaro David Gómez Antón; Luning Sun; Manuel Cebrian; Lexin Zhou; Yael Moros Daval; Daniel Romero-Alvarado; Félix Martí Pérez; Kevin Wei; José Hernández-Orallo

From Human-Level AI Tales to AI Leveling Human Scales

Peter Romero, Fernando Martínez-Plumed, Zachary R. Tyler, Matthieu Téhénan, Sipeng Chen, Álvaro David Gómez Antón, Luning Sun, Manuel Cebrian, Lexin Zhou, Yael Moros Daval, Daniel Romero-Alvarado, Félix Martí Pérez, Kevin Wei, José Hernández-Orallo

TL;DR

This work builds on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base of $B$ and evaluates the quality of different mappings using group slicing and post-stratification.

Abstract

Comparing AI models to "human level" is often misleading when benchmark scores are incommensurate or human baselines are drawn from a narrow population. To address this, we propose a framework that calibrates items against the 'world population' and report performance on a common, human-anchored scale. Concretely, we build on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base $B$. We calibrate each scale for each capability (reasoning, comprehension, knowledge, volume, etc.) by compiling publicly released human test data spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench). The base $B$ is estimated by extrapolating between samples with two demographic profiles using LLMs, with the hypothesis that they condense rich information about human populations. We evaluate the quality of different mappings using group slicing and post-stratification. The new techniques allow for the recalibration and standardization of scales relative to the whole-world population.

From Human-Level AI Tales to AI Leveling Human Scales

TL;DR

and evaluates the quality of different mappings using group slicing and post-stratification.

Abstract

. We calibrate each scale for each capability (reasoning, comprehension, knowledge, volume, etc.) by compiling publicly released human test data spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench). The base

is estimated by extrapolating between samples with two demographic profiles using LLMs, with the hypothesis that they condense rich information about human populations. We evaluate the quality of different mappings using group slicing and post-stratification. The new techniques allow for the recalibration and standardization of scales relative to the whole-world population.

Paper Structure (38 sections, 1 equation, 10 figures, 16 tables)

This paper contains 38 sections, 1 equation, 10 figures, 16 tables.

Introduction
Related Work
Methodology
Item pools and observed human performance
Instance-level demand annotation with ADeLe
Instance-based calibration with LLMs
Validation
Experiments
Data
Experimental Setup
Results and Analysis
Validation: Comparing Ground Truth with Extrapolation
Calibration: From Theoretical Annotations to Empirical Scaling
Scale Base Calibration
Conclusion
...and 23 more sections

Figures (10)

Figure 1: Calibrated annotations of benchmarks can be used to generate profiles of AI systems on human-referenced scales (top). In this paper we calibrate 18 dimensions of capability and knowledge, going from level 0 (near-universal success) to level 5 $\approx$ 1-in-$B^5$ people succeeding, with $B$ being normalized according to the human distribution taken from several tests with human results (bottom). The calibration uses this source human sample and their demographics through an LLM to extrapolate to the whole world population to calibrate the scales.
Figure 2: Human-calibrated bases $B$ for groups of dimensions. For each plot we only use the examples for which any of the dimensions of that group is dominant, and using harmonic mean when the group contains more than one dimension. The $x$-axis shows the level, as annotated following the ADeLe rubrics, and the $y$-axis shows the corresponding levels as coming from the LLM estimate. By fitting a linear function using the means of the levels ($\star$) we can derive the human-calibrated base for each group. We see that many are smaller than 10, as assumed in the original scales, suggesting that this calibration should be applied when comparing values for different dimensions. Mind Modeling & Social Cognition has a negative slope probably because it has the smallest number of examples for a good estimate at levels 3 and 4.
Figure 3: Mapping of dominant ADeLe dimension from extrapolated calibrations (what model thinks about how the world would perform) against real world outcomes of test takers from ICAR Logical Reasoning
Figure 4: Mapping of dominant ADeLe dimension from extrapolated calibrations (what model thinks about how the world would perform) against real world outcomes of test takers from ICAR Verbal Reasoning
Figure 5: Mapping of dominant ADeLe dimension from extrapolated calibrations (what model thinks about how the world would perform) against real world outcomes of test takers from PISA results.
...and 5 more figures

From Human-Level AI Tales to AI Leveling Human Scales

TL;DR

Abstract

From Human-Level AI Tales to AI Leveling Human Scales

Authors

TL;DR

Abstract

Table of Contents

Figures (10)