Table of Contents
Fetching ...

From Human-Level AI Tales to AI Leveling Human Scales

Peter Romero, Fernando Martínez-Plumed, Zachary R. Tyler, Matthieu Téhénan, Sipeng Chen, Álvaro David Gómez Antón, Luning Sun, Manuel Cebrian, Lexin Zhou, Yael Moros Daval, Daniel Romero-Alvarado, Félix Martí Pérez, Kevin Wei, José Hernández-Orallo

TL;DR

This work builds on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base of $B$ and evaluates the quality of different mappings using group slicing and post-stratification.

Abstract

Comparing AI models to "human level" is often misleading when benchmark scores are incommensurate or human baselines are drawn from a narrow population. To address this, we propose a framework that calibrates items against the 'world population' and report performance on a common, human-anchored scale. Concretely, we build on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base $B$. We calibrate each scale for each capability (reasoning, comprehension, knowledge, volume, etc.) by compiling publicly released human test data spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench). The base $B$ is estimated by extrapolating between samples with two demographic profiles using LLMs, with the hypothesis that they condense rich information about human populations. We evaluate the quality of different mappings using group slicing and post-stratification. The new techniques allow for the recalibration and standardization of scales relative to the whole-world population.

From Human-Level AI Tales to AI Leveling Human Scales

TL;DR

This work builds on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base of and evaluates the quality of different mappings using group slicing and post-stratification.

Abstract

Comparing AI models to "human level" is often misleading when benchmark scores are incommensurate or human baselines are drawn from a narrow population. To address this, we propose a framework that calibrates items against the 'world population' and report performance on a common, human-anchored scale. Concretely, we build on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base . We calibrate each scale for each capability (reasoning, comprehension, knowledge, volume, etc.) by compiling publicly released human test data spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench). The base is estimated by extrapolating between samples with two demographic profiles using LLMs, with the hypothesis that they condense rich information about human populations. We evaluate the quality of different mappings using group slicing and post-stratification. The new techniques allow for the recalibration and standardization of scales relative to the whole-world population.
Paper Structure (38 sections, 1 equation, 10 figures, 16 tables)

This paper contains 38 sections, 1 equation, 10 figures, 16 tables.

Figures (10)

  • Figure 1: Calibrated annotations of benchmarks can be used to generate profiles of AI systems on human-referenced scales (top). In this paper we calibrate 18 dimensions of capability and knowledge, going from level 0 (near-universal success) to level 5 $\approx$ 1-in-$B^5$ people succeeding, with $B$ being normalized according to the human distribution taken from several tests with human results (bottom). The calibration uses this source human sample and their demographics through an LLM to extrapolate to the whole world population to calibrate the scales.
  • Figure 2: Human-calibrated bases $B$ for groups of dimensions. For each plot we only use the examples for which any of the dimensions of that group is dominant, and using harmonic mean when the group contains more than one dimension. The $x$-axis shows the level, as annotated following the ADeLe rubrics, and the $y$-axis shows the corresponding levels as coming from the LLM estimate. By fitting a linear function using the means of the levels ($\star$) we can derive the human-calibrated base for each group. We see that many are smaller than 10, as assumed in the original scales, suggesting that this calibration should be applied when comparing values for different dimensions. Mind Modeling & Social Cognition has a negative slope probably because it has the smallest number of examples for a good estimate at levels 3 and 4.
  • Figure 3: Mapping of dominant ADeLe dimension from extrapolated calibrations (what model thinks about how the world would perform) against real world outcomes of test takers from ICAR Logical Reasoning
  • Figure 4: Mapping of dominant ADeLe dimension from extrapolated calibrations (what model thinks about how the world would perform) against real world outcomes of test takers from ICAR Verbal Reasoning
  • Figure 5: Mapping of dominant ADeLe dimension from extrapolated calibrations (what model thinks about how the world would perform) against real world outcomes of test takers from PISA results.
  • ...and 5 more figures