Table of Contents
Fetching ...

Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks

Melanie Mitchell, Alessandro B. Palmarini, Arseny Moskvichev

TL;DR

This study rigorously evaluates abstract reasoning in GPT-4 and GPT-4V using the ConceptARC benchmark. By employing a richer one-shot prompt for the text-only model and a visually grounded, minimal-task setup for the multimodal model, the authors show that GPT-4 achieves modest gains but remains far from human-like abstraction, and GPT-4V performs even worse on minimal image-based tasks. The results challenge claims of emergent, robust abstract reasoning in current large LLMs and highlight significant gaps in cross-modal generalization. The work suggests that alternative representations or prompting strategies may be necessary to close the abstraction-performance gap in AI systems.

Abstract

We explore the abstract reasoning abilities of text-only and multimodal versions of GPT-4, using the ConceptARC benchmark [10], which is designed to evaluate robust understanding and reasoning with core-knowledge concepts. We extend the work of Moskvichev et al. [10] by evaluating GPT-4 on more detailed, one-shot prompting (rather than simple, zero-shot prompts) with text versions of ConceptARC tasks, and by evaluating GPT-4V, the multimodal version of GPT-4, on zero- and one-shot prompts using image versions of the simplest tasks. Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.

Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks

TL;DR

This study rigorously evaluates abstract reasoning in GPT-4 and GPT-4V using the ConceptARC benchmark. By employing a richer one-shot prompt for the text-only model and a visually grounded, minimal-task setup for the multimodal model, the authors show that GPT-4 achieves modest gains but remains far from human-like abstraction, and GPT-4V performs even worse on minimal image-based tasks. The results challenge claims of emergent, robust abstract reasoning in current large LLMs and highlight significant gaps in cross-modal generalization. The work suggests that alternative representations or prompting strategies may be necessary to close the abstraction-performance gap in AI systems.

Abstract

We explore the abstract reasoning abilities of text-only and multimodal versions of GPT-4, using the ConceptARC benchmark [10], which is designed to evaluate robust understanding and reasoning with core-knowledge concepts. We extend the work of Moskvichev et al. [10] by evaluating GPT-4 on more detailed, one-shot prompting (rather than simple, zero-shot prompts) with text versions of ConceptARC tasks, and by evaluating GPT-4V, the multimodal version of GPT-4, on zero- and one-shot prompts using image versions of the simplest tasks. Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.
Paper Structure (9 sections, 5 figures, 2 tables)

This paper contains 9 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Examples of ARC tasks from ARC-Github. Each task has a set of demonstration input-output pairs that illustrate an abstract grid-transformation rule, and a test input. The solver's challenge is to generate a new grid that results from applying the abstract rule to the test input. (Figure is from moskvichev2023conceptarc; best viewed in color.)
  • Figure 2: (a) A task from the ConceptARC corpus. (b) The corresponding prompt used in moskvichev2023conceptarc to give to GPT-4. (Image is from moskvichev2023conceptarc; best viewed in color.)
  • Figure 3: Example of the prompt used to test text-only GPT-4 on ConceptARC tasks. The symbol "#" indicates comments not given in the actual prompt.
  • Figure 4: Example of the prompt used to test GPT-4V in the one-shot setting. The symbol "#" indicates comments not given in the actual prompt. Red text signifies differences from the prompt used to test text-only GPT-4, as provided in Figure \ref{['GPT4NewPromptExample']}.
  • Figure 5: Example of the prompts used to test GPT-4V in the zero-shot setting. The symbol "#" indicates comments not given in the actual prompt.