Table of Contents
Fetching ...

Evaluating the Robustness of Analogical Reasoning in Large Language Models

Martha Lewis, Melanie Mitchell

TL;DR

Investigating the robustness of analogy-making abilities previously claimed for LLMs on three of four domains studied by Webb, Holyoak, and Lu (2023) finds that, unlike humans, the performance of GPT models are susceptible to answer-order effects, and that GPT models also may be more sensitive than humans to paraphrasing.

Abstract

LLMs have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, there is debate on the extent to which they are performing general abstract reasoning versus employing non-robust processes, e.g., that overly rely on similarity to pre-training data. Here we investigate the robustness of analogy-making abilities previously claimed for LLMs on three of four domains studied by Webb, Holyoak, and Lu (2023): letter-string analogies, digit matrices, and story analogies. For each domain we test humans and GPT models on robustness to variants of the original analogy problems that test the same abstract reasoning abilities but are likely dissimilar from tasks in the pre-training data. The performance of a system that uses robust abstract reasoning should not decline substantially on these variants. On simple letter-string analogies, we find that while the performance of humans remains high for two types of variants we tested, the GPT models' performance declines sharply. This pattern is less pronounced as the complexity of these problems is increased, as both humans and GPT models perform poorly on both the original and variant problems requiring more complex analogies. On digit-matrix problems, we find a similar pattern but only on one out of the two types of variants we tested. On story-based analogy problems, we find that, unlike humans, the performance of GPT models are susceptible to answer-order effects, and that GPT models also may be more sensitive than humans to paraphrasing. This work provides evidence that LLMs often lack the robustness of zero-shot human analogy-making, exhibiting brittleness on most of the variations we tested. More generally, this work points to the importance of carefully evaluating AI systems not only for accuracy but also robustness when testing their cognitive capabilities.

Evaluating the Robustness of Analogical Reasoning in Large Language Models

TL;DR

Investigating the robustness of analogy-making abilities previously claimed for LLMs on three of four domains studied by Webb, Holyoak, and Lu (2023) finds that, unlike humans, the performance of GPT models are susceptible to answer-order effects, and that GPT models also may be more sensitive than humans to paraphrasing.

Abstract

LLMs have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, there is debate on the extent to which they are performing general abstract reasoning versus employing non-robust processes, e.g., that overly rely on similarity to pre-training data. Here we investigate the robustness of analogy-making abilities previously claimed for LLMs on three of four domains studied by Webb, Holyoak, and Lu (2023): letter-string analogies, digit matrices, and story analogies. For each domain we test humans and GPT models on robustness to variants of the original analogy problems that test the same abstract reasoning abilities but are likely dissimilar from tasks in the pre-training data. The performance of a system that uses robust abstract reasoning should not decline substantially on these variants. On simple letter-string analogies, we find that while the performance of humans remains high for two types of variants we tested, the GPT models' performance declines sharply. This pattern is less pronounced as the complexity of these problems is increased, as both humans and GPT models perform poorly on both the original and variant problems requiring more complex analogies. On digit-matrix problems, we find a similar pattern but only on one out of the two types of variants we tested. On story-based analogy problems, we find that, unlike humans, the performance of GPT models are susceptible to answer-order effects, and that GPT models also may be more sensitive than humans to paraphrasing. This work provides evidence that LLMs often lack the robustness of zero-shot human analogy-making, exhibiting brittleness on most of the variations we tested. More generally, this work points to the importance of carefully evaluating AI systems not only for accuracy but also robustness when testing their cognitive capabilities.

Paper Structure

This paper contains 57 sections, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Example items in human study.
  • Figure 2: Human results on WHL's original letter-string task for problems with zero to three generalizations. "WHL" refers to the results of WHL's original human studies, and "Ours" refers to the results of our human studies. Data points give mean accuracy across all transformation types, and bars indicate 95% binomial confidence intervals. Numbers of samples for WHL are 342 for each number of generalizations. Numbers of samples for our data are 276 for zero generalizations, 138 for one, and 92 each for two and three generalizations.
  • Figure 3: GPT results on WHL's original letter-string task for problems with zero to three generalizations. "GPT-3 WHL" refers to WHL's results on GPT-3. "GPT-3", "GPT-3.5", and "GPT-4" refer to our results with those models. Data points give mean accuracy across all task types, and bars indicate 95% binomial confidence intervals. Numbers of samples for GPT-3 WHL are 300 for each number of generalizations. Numbers of samples for our data are 420 for zero and one generalization, and 490 for each of two and three generalizations.
  • Figure 4: Accuracy on different alphabet types, across different numbers of generalizations, for human participants and GPT models. Data points indicate mean accuracy across all transformation types for different alphabets, and bars indicate 95% binomial confidence intervals. The number of samples for each human data point is given in Table \ref{['tab:letterstring_num_samples']}. The number of samples for GPT-model data point is as follows. For zero or one generalizations, and 0-20 letters permuted, each data point corresponds to 420 samples. For two or three generalizations, 0-20 letters permuted, each data point corresponds to 490 samples. For the symbol alphabets, zero-generalization data points correspond to 40 samples and one-generalization correspond to 420 samples.
  • Figure 5: Example items in human study.
  • ...and 8 more figures