Evidence from counterfactual tasks supports emergent analogical reasoning in large language models

Taylor Webb; Keith J. Holyoak; Hongjing Lu

Evidence from counterfactual tasks supports emergent analogical reasoning in large language models

Taylor Webb, Keith J. Holyoak, Hongjing Lu

TL;DR

Large language models exhibit zero-shot analogical reasoning capabilities, but counterfactual task critiques question their generality. The authors demonstrate that permuted-alphabet tasks can be solved by GPT-4 when code execution enables precise counting, arguing that these results reflect genuine reasoning rather than memorized data. They link emergent analogical reasoning to structured relational representations and in-context learning, while providing open datasets and materials to support replication. Overall, the work clarifies the conditions under which LLMs generalize analogical reasoning and outlines directions for probing underlying cognitive-like mechanisms.

Abstract

We recently reported evidence that large language models are capable of solving a wide range of text-based analogy problems in a zero-shot manner, indicating the presence of an emergent capacity for analogical reasoning. Two recent commentaries have challenged these results, citing evidence from so-called `counterfactual' tasks in which the standard sequence of the alphabet is arbitrarily permuted so as to decrease similarity with materials that may have been present in the language model's training data. Here, we reply to these critiques, clarifying some misunderstandings about the test materials used in our original work, and presenting evidence that language models are also capable of generalizing to these new counterfactual task variants.

Evidence from counterfactual tasks supports emergent analogical reasoning in large language models

TL;DR

Abstract

Paper Structure (16 sections, 4 figures, 2 tables)

This paper contains 16 sections, 4 figures, 2 tables.

Original test materials
Counterfactual tasks
Mechanisms underlying emergent analogical reasoning
Methods
Code
Problem Set
Evaluating GPT-4
Evaluating GPT-4 with code execution
Human Behavioral Experiment
Statistical analyses
Supplementary Results
Data Availability
Example GPT-4 response involving code execution
Results by problem type
Evaluating the impact of training data cutoff date
...and 1 more sections

Figures (4)

Figure 1: Results for letter-string analogies with shuffled alphabet.(a) Example problem from Hodel & West hodel2023response. Letter-string analogies are constructed based on a permuted alphabet. This example involves a successor relation with an interval size of 1, applied to the final letter of the string. Other problems involved an interval size of 2. See Methods (Supplementary Section S1) for more details. (b) Results for human participants, GPT-4, and a variant of GPT-4 augmented with the capacity to write and execute code (which the model used to identify the positions of letters in the permuted alphabet). Both humans and GPT-4 showed greater difficulty on problems involving an interval of size 2 (main effect of interval size, human participants: $P=1.5\times10^{-11}$, GPT-4: $P<2\times10^{-16}$). This effect did not reach significance for GPT-4 + code execution ($P=0.066$). Human participants outperformed GPT-4 (main effect of human participants vs. GPT-4: $P=8.9\times10^{-13}$), but GPT-4 + code execution performed on par with human participants (main effect of human participants vs. GPT-4 + code execution: $P=0.496$). When correct, GPT-4's responses were also accompanied by accurate explanations of the underlying rule, and incorrect responses were often based on a valid alternative rule (see Supplementary Results). Human results reflect average performance for $N=99$ participants for interval-size-1 and $N=97$ separate paticipants for interval-size-2. Black error bars represent standard error of the mean across participants. Grey error bars represent 95% binomial CIs for average performance across multiple problems.
Figure 1: Results for letter-string analogies with shuffled alphabet, sorted by transformation type and interval size.(a) Results for human participants ($N=99$), GPT-4, and GPT-4 + code execution on problems with an interval size of 1. (b) Results for human participants ($N=97$), GPT-4, and GPT-4 + code execution on problems with an interval size of 2. Black error bars represent standard error of the mean across participants. Grey error bars represent 95% binomial CIs for average performance across multiple problems.
Figure 2: Results for GPT-4 + code execution on the original vs. new synthetic alphabets.(a) Results for problems with an interval size of 1. (b) Results for problems with an interval size of 2. Performance was comparable for the two alphabets (logistic regression with binary predictor coding for old vs. new alphabet: $P=0.52$). Error bars represent 95% binomial CIs for average performance across multiple problems.
Figure 3: Results for GPT-4 using engines with different training data cutoff dates.(a) Results for problems with an interval size of 1. (b) Results for problems with an interval size of 2. Performance was comparable for the two engines (logistic regression with binary predictor coding for old vs. new engine: $P=0.12$). Error bars represent 95% binomial CIs for average performance across multiple problems.

Evidence from counterfactual tasks supports emergent analogical reasoning in large language models

TL;DR

Abstract

Evidence from counterfactual tasks supports emergent analogical reasoning in large language models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)