Response: Emergent analogical reasoning in large language models

Damian Hodel; Jevin West

Response: Emergent analogical reasoning in large language models

Damian Hodel, Jevin West

TL;DR

The paper challenges the claim that GPT-3 exhibits emergent zero-shot analogical reasoning by presenting counterexamples in letter-string analogy tasks. It shows GPT-3's performance deteriorates on simple variants and under a synthetic alphabet, while humans remain robust, suggesting brittleness and possible reliance on memorized data rather than genuine generalization. The authors argue that zero-shot reasoning requires evidence that training data do not contain the problems or solutions, which is difficult to verify, and that human-centric tests may not transfer to LLM capabilities. They call for more rigorous, memorization-aware evaluation methods to accurately assess true reasoning in large language models and to curb anthropomorphized interpretations of model behavior.

Abstract

In their recent Nature Human Behaviour paper, "Emergent analogical reasoning in large language models," (Webb, Holyoak, and Lu, 2023) the authors argue that "large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems." In this response, we provide counterexamples of the letter string analogies. In our tests, GPT-3 fails to solve simplest variations of the original tasks, whereas human performance remains consistently high across all modified versions. Zero-shot reasoning is an extraordinary claim that requires extraordinary evidence. We do not see that evidence in our experiments. To strengthen claims of humanlike reasoning such as zero-shot reasoning, it is important that the field develop approaches that rule out data memorization.

Response: Emergent analogical reasoning in large language models

TL;DR

Abstract

Paper Structure (14 sections, 7 figures)

This paper contains 14 sections, 7 figures.

Introduction
Criticism of the Methods Employed in the Original Paper
Conclusion
Code and data availability
Author contributions
Competing interests
Appendix
Counterexamples
Methods
GPT-3 evaluation
Human behavioral experiment.
Results
Discussion
ChatGPT's answer to our question: "Could you give an example of a copycat problem?"

Figures (7)

Figure 1: Letter string analogies along their transformations of both the original paper and our counterexamples. We introduce a synthetic alphabet into the task and apply two types of letter sequence modifications, both based on increasing the interval from one to two letters. For the transformation types 'extend sequence', 'successor', and 'predecessor', the modification only affects the letter to change (last or first letter). For 'remove redundant letter', 'fix alphabetic sequence', and 'sort', the interval is increased for the complete letter sequence. We apply the same modifications to the problems generated with the synthetic alphabet.
Figure 2: Comparison between GPT-3's (blue) and human (orange) performances on modified letter string problems involving a synthetic alphabet and a larger interval size. The transformation types and their order correspond to Figure 6b in the original paper. Humans demonstrate significantly higher accuracy compared to GPT-3. Human results represent the average performance of 121 participants (UW undergraduates). Each participant received one randomly selected instance of each problem subtype. GPT-3 results reflect the average performance across all 50 instances. Gray error bars indicate 95% binomial confidence intervals for the average performance across multiple problems.
Figure 3: GPT-3 performance for zero-generalization letter string problems for the original experiment (blue) and with the larger interval size (green), and larger interval size with synthetic alphabet (orange). Except for 'remove redundant letter,' GPT-3's accuracy declines significantly for the modified problems. The results reflect an average performance for N=50 instances.
Figure 4: Human performance for zero-generalization letter string problems for the original experiment (blue) and with the larger interval size (green), and larger interval size with synthetic alphabet (orange). Human accuracy in the modified problems is comparable to that in the original problems (blue). The results reflect the average performance of N = 121 participants (UW undergraduates).
Figure 5: Counterfactual comprehension check. Comparison of GPT-3 performance on zero-generalization letter string problems between original tasks (blue) and the only marginally modified tasks involving a synthetic alphabet without modification of the interval size (green) and a modified prompt without modified string sequence (orange). The accuracy on modified tasks is lower than on the original ones but, greater than 0.2 except for 'remove redundant letter' and 'sort' involving the synthetic alphabet. The figure and the order of the transformation types correspond to Figure 6b in the original paper. These results reflect an average performance for N=50 instances.
...and 2 more figures

Response: Emergent analogical reasoning in large language models

TL;DR

Abstract

Response: Emergent analogical reasoning in large language models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)