Table of Contents
Fetching ...

The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks

Isaac R. Galatzer-Levy, David Munday, Jed McGiffin, Xin Liu, Danny Karmon, Ilia Labzovsky, Rivka Moroshko, Amir Zait, Daniel McDuff

TL;DR

This study benchmarks state-of-the-art generative AI models against human norms using the WAIS-IV to quantify verbal, working memory, and perceptual reasoning abilities. By text-prompting WAIS-IV subtests and computing index-level scores, the authors compare language-only and multimodal models across Verbal Comprehension, Working Memory, and Perceptual Reasoning. The key finding is that models excel in Verbal Comprehension and Working Memory but show dramatic deficits in Perceptual Reasoning, with notable variation across model generations and architectures; small or older models lag behind larger, more tuned systems. The work demonstrates both the potential and the limits of current GenAI as cognitive systems, underscoring the need for domain-specific multimodal architectures and careful interpretation when benchmarking against human cognitive standards.

Abstract

There is increasing interest in tracking the capabilities of general intelligence foundation models. This study benchmarks leading large language models and vision language models against human performance on the Wechsler Adult Intelligence Scale (WAIS-IV), a comprehensive, population-normed assessment of underlying human cognition and intellectual abilities, with a focus on the domains of VerbalComprehension (VCI), Working Memory (WMI), and Perceptual Reasoning (PRI). Most models demonstrated exceptional capabilities in the storage, retrieval, and manipulation of tokens such as arbitrary sequences of letters and numbers, with performance on the Working Memory Index (WMI) greater or equal to the 99.5th percentile when compared to human population normative ability. Performance on the Verbal Comprehension Index (VCI) which measures retrieval of acquired information, and linguistic understanding about the meaning of words and their relationships to each other, also demonstrated consistent performance at or above the 98th percentile. Despite these broad strengths, we observed consistently poor performance on the Perceptual Reasoning Index (PRI; range 0.1-10th percentile) from multimodal models indicating profound inability to interpret and reason on visual information. Smaller and older model versions consistently performed worse, indicating that training data, parameter count and advances in tuning are resulting in significant advances in cognitive ability.

The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks

TL;DR

This study benchmarks state-of-the-art generative AI models against human norms using the WAIS-IV to quantify verbal, working memory, and perceptual reasoning abilities. By text-prompting WAIS-IV subtests and computing index-level scores, the authors compare language-only and multimodal models across Verbal Comprehension, Working Memory, and Perceptual Reasoning. The key finding is that models excel in Verbal Comprehension and Working Memory but show dramatic deficits in Perceptual Reasoning, with notable variation across model generations and architectures; small or older models lag behind larger, more tuned systems. The work demonstrates both the potential and the limits of current GenAI as cognitive systems, underscoring the need for domain-specific multimodal architectures and careful interpretation when benchmarking against human cognitive standards.

Abstract

There is increasing interest in tracking the capabilities of general intelligence foundation models. This study benchmarks leading large language models and vision language models against human performance on the Wechsler Adult Intelligence Scale (WAIS-IV), a comprehensive, population-normed assessment of underlying human cognition and intellectual abilities, with a focus on the domains of VerbalComprehension (VCI), Working Memory (WMI), and Perceptual Reasoning (PRI). Most models demonstrated exceptional capabilities in the storage, retrieval, and manipulation of tokens such as arbitrary sequences of letters and numbers, with performance on the Working Memory Index (WMI) greater or equal to the 99.5th percentile when compared to human population normative ability. Performance on the Verbal Comprehension Index (VCI) which measures retrieval of acquired information, and linguistic understanding about the meaning of words and their relationships to each other, also demonstrated consistent performance at or above the 98th percentile. Despite these broad strengths, we observed consistently poor performance on the Perceptual Reasoning Index (PRI; range 0.1-10th percentile) from multimodal models indicating profound inability to interpret and reason on visual information. Smaller and older model versions consistently performed worse, indicating that training data, parameter count and advances in tuning are resulting in significant advances in cognitive ability.

Paper Structure

This paper contains 12 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Benchmarking Models Against Human Performance on the Wechsler Adult Intelligence Scale. We perform a comprehensive, population-normed assessment of AI models against underlying human cognition and intellectual abilities, with a focus on the domains of Verbal Comprehension (VCI), Working Memory (WMI), andPerceptual Reasoning (PRI).
  • Figure 2: Example of the Matrix Reasoning test in which subjects are requested to identify a pattern between different rows and columns of the matrix. In this example, the first row is composed of green square, followed by a red triangle (from left to right). The second row also starts with a green square meaning that the correct option in the question mark is a red triangle. Answer: 2
  • Figure 3: Example of the Visual Puzzles test in which subjects are requested to identify all the pieces that compose the provided design. In this example, the provided design is a three color rectangle. It can be composed by arranging rectangle (2), rectangle (1) and flipping rectangle (4), from left to right. Answer: 1,2,4.
  • Figure 4: Example of the Figure Weights test in which subjects are requested to determine the correct way to balance the scales by finding the missing pieces. In this example, the scale on the left holds a single red circle, while the scale on the right holds two red circles. Considering that each red circle carries the same weight, attaining balance necessitates the addition of one red circle to the scale on the left. Answer: 1.