Table of Contents
Fetching ...

On Benchmarking Human-Like Intelligence in Machines

Lance Ying, Katherine M. Collins, Lionel Wong, Ilia Sucholutsky, Ryan Liu, Adrian Weller, Tianmin Shu, Thomas L. Griffiths, Joshua B. Tenenbaum

TL;DR

The paper critiques current AI benchmarks for lacking human-validated labels, insufficiently capturing human variability and uncertainty, and lacking ecological validity. Through a human-data study on ten benchmarks, it reveals biases and misalignments between ground-truth labels and human judgments. It then offers five recommendations to advance benchmarking: use human ground-truth data and robust samples, evaluate against population-level distributions with soft labels, measure graded uncertainty, ground tasks in cognitive theory, and prioritize ecologically valid, cognitively rich tasks. Together, these proposals aim to yield more rigorous, generalizable assessments of human-like intelligence in AI with implications for alignment and safe, effective human-AI collaboration.

Abstract

Recent benchmark studies have claimed that AI has approached or even surpassed human-level performances on various cognitive tasks. However, this position paper argues that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities. We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks. We support our claims by conducting a human evaluation study on ten existing AI benchmarks, suggesting significant biases and flaws in task and label designs. To address these limitations, we propose five concrete recommendations for developing future benchmarks that will enable more rigorous and meaningful evaluations of human-like cognitive capacities in AI with various implications for such AI applications.

On Benchmarking Human-Like Intelligence in Machines

TL;DR

The paper critiques current AI benchmarks for lacking human-validated labels, insufficiently capturing human variability and uncertainty, and lacking ecological validity. Through a human-data study on ten benchmarks, it reveals biases and misalignments between ground-truth labels and human judgments. It then offers five recommendations to advance benchmarking: use human ground-truth data and robust samples, evaluate against population-level distributions with soft labels, measure graded uncertainty, ground tasks in cognitive theory, and prioritize ecologically valid, cognitively rich tasks. Together, these proposals aim to yield more rigorous, generalizable assessments of human-like intelligence in AI with implications for alignment and safe, effective human-AI collaboration.

Abstract

Recent benchmark studies have claimed that AI has approached or even surpassed human-level performances on various cognitive tasks. However, this position paper argues that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities. We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks. We support our claims by conducting a human evaluation study on ten existing AI benchmarks, suggesting significant biases and flaws in task and label designs. To address these limitations, we propose five concrete recommendations for developing future benchmarks that will enable more rigorous and meaningful evaluations of human-like cognitive capacities in AI with various implications for such AI applications.

Paper Structure

This paper contains 27 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Distribution of participants' agreement with benchmark labels across all 300 stimuli. 26.67% of the stimuli have less than 50% agreement with the label (i.e. less than half of the participants selected the label provided by the benchmark).
  • Figure 2: Distribution of participants' ratings on one of the stimuli. The ground truth label is "unsupportive".
  • Figure 3: Distribution of participants' ratings on soft labels across all 300 stimuli. Each rating maps onto a ground-truth label of 0 or 100, except 625 ratings where the underlying label is 50 (Neutral).
  • Figure 4: The Food truck experiment used by baker2017rational to study human social reasoning. In this domain, a participant watches an agent moving to get food from a foodtruck. There are three kinds of foodtrucks: Lebanese (L), Mexican (M) and Korean (K). The agent cannot see what foodtruck is behind the wall unless they walk behind it to check. After observing the agent's trajectory, the participant is asked to judge the agent's preference of the foodtrucks and their belief of what foodtruck is behind the wall on a Likert scale. The results show graded judgment in humans across different agent trajectories.