Response: Emergent analogical reasoning in large language models
Damian Hodel, Jevin West
TL;DR
The paper challenges the claim that GPT-3 exhibits emergent zero-shot analogical reasoning by presenting counterexamples in letter-string analogy tasks. It shows GPT-3's performance deteriorates on simple variants and under a synthetic alphabet, while humans remain robust, suggesting brittleness and possible reliance on memorized data rather than genuine generalization. The authors argue that zero-shot reasoning requires evidence that training data do not contain the problems or solutions, which is difficult to verify, and that human-centric tests may not transfer to LLM capabilities. They call for more rigorous, memorization-aware evaluation methods to accurately assess true reasoning in large language models and to curb anthropomorphized interpretations of model behavior.
Abstract
In their recent Nature Human Behaviour paper, "Emergent analogical reasoning in large language models," (Webb, Holyoak, and Lu, 2023) the authors argue that "large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems." In this response, we provide counterexamples of the letter string analogies. In our tests, GPT-3 fails to solve simplest variations of the original tasks, whereas human performance remains consistently high across all modified versions. Zero-shot reasoning is an extraordinary claim that requires extraordinary evidence. We do not see that evidence in our experiments. To strengthen claims of humanlike reasoning such as zero-shot reasoning, it is important that the field develop approaches that rule out data memorization.
