Table of Contents
Fetching ...

Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

Martha Lewis, Melanie Mitchell

TL;DR

<3-5 sentence high-level summary> This study tests whether large language models (LLMs) exhibit genuine, general analogical reasoning or rely on training-data-like cues by introducing counterfactual variants of letter-string analogy problems. It compares humans and three GPT models (GPT-3, GPT-3.5, GPT-4) on original and counterfactual tasks created with permuted alphabets and symbol sets. Humans maintain high accuracy across both task types, whereas GPT models show substantial performance drops under counterfactual conditions, indicating a lack of robust generality in LLM analogy-making. The findings challenge claims of human-like abstract reasoning in current LLMs and highlight the need for counterfactual evaluation to assess true generalization in artificial reasoning systems.

Abstract

Large language models (LLMs) have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, it has been debated whether they are actually performing humanlike abstract reasoning or instead employing less general processes that rely on similarity to what has been seen in their training data. Here we investigate the generality of analogy-making abilities previously claimed for LLMs (Webb, Holyoak, & Lu, 2023). We take one set of analogy problems used to evaluate LLMs and create a set of "counterfactual" variants-versions that test the same abstract reasoning abilities but that are likely dissimilar from any pre-training data. We test humans and three GPT models on both the original and counterfactual problems, and show that, while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set. This work provides evidence that, despite previously reported successes of LLMs on analogical reasoning, these models lack the robustness and generality of human analogy-making.

Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

TL;DR

<3-5 sentence high-level summary> This study tests whether large language models (LLMs) exhibit genuine, general analogical reasoning or rely on training-data-like cues by introducing counterfactual variants of letter-string analogy problems. It compares humans and three GPT models (GPT-3, GPT-3.5, GPT-4) on original and counterfactual tasks created with permuted alphabets and symbol sets. Humans maintain high accuracy across both task types, whereas GPT models show substantial performance drops under counterfactual conditions, indicating a lack of robust generality in LLM analogy-making. The findings challenge claims of human-like abstract reasoning in current LLMs and highlight the need for counterfactual evaluation to assess true generalization in artificial reasoning systems.

Abstract

Large language models (LLMs) have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, it has been debated whether they are actually performing humanlike abstract reasoning or instead employing less general processes that rely on similarity to what has been seen in their training data. Here we investigate the generality of analogy-making abilities previously claimed for LLMs (Webb, Holyoak, & Lu, 2023). We take one set of analogy problems used to evaluate LLMs and create a set of "counterfactual" variants-versions that test the same abstract reasoning abilities but that are likely dissimilar from any pre-training data. We test humans and three GPT models on both the original and counterfactual problems, and show that, while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set. This work provides evidence that, despite previously reported successes of LLMs on analogical reasoning, these models lack the robustness and generality of human analogy-making.
Paper Structure (19 sections, 7 figures, 4 tables)

This paper contains 19 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Example analogy problem with permuted alphabet, in format seen by human participants.
  • Figure 2: Example analogy problem with symbolic alphabet.
  • Figure 3: Example attention check.
  • Figure 4: Human performance across problem types in the zero-generalization setting on unpermuted alphabets (Ours, orange; Webb et al.,blue). Data points represent the average of 46 samples for our data, and 57 samples for Webb et al.'s. Bars give 95% binomial confidence intervals.
  • Figure 5: Comparison of GPT computational results with Webb2023a in the zero-generalization setting. Points represent accuracy and bars represent 95% binomial confidence intervals. Each data point represents the average of 70 samples for our data and 100 samples for Webb et al.'s data.
  • ...and 2 more figures