In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models

Pengrui Han; Peiyang Song; Haofei Yu; Jiaxuan You

In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models

Pengrui Han, Peiyang Song, Haofei Yu, Jiaxuan You

TL;DR

It is found that state-of-the-art LLMs (like Llama3-8b) perform consistently well with in-context learning (ICL) but make errors and show a significant drop in reasoning tasks when the context changes trivially, suggesting that LLMs only have inhibitory control abilities on par with human infants in this regard.

Abstract

Recent advancements in artificial intelligence have led to the creation of highly capable large language models (LLMs) that can perform tasks in a human-like manner. However, LLMs exhibit only infant-level cognitive abilities in certain areas. One such area is the A-Not-B error, a phenomenon seen in infants where they repeat a previously rewarded behavior despite well-observed changed conditions. This highlights their lack of inhibitory control -- the ability to stop a habitual or impulsive response. In our work, we design a text-based multi-choice QA scenario similar to the A-Not-B experimental settings to systematically test the inhibitory control abilities of LLMs. We found that state-of-the-art LLMs (like Llama3-8b) perform consistently well with in-context learning (ICL) but make errors and show a significant drop of as many as 83.3% in reasoning tasks when the context changes trivially. This suggests that LLMs only have inhibitory control abilities on par with human infants in this regard, often failing to suppress the previously established response pattern during ICL.

In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models

TL;DR

Abstract

Paper Structure (31 sections, 11 figures, 3 tables)

This paper contains 31 sections, 11 figures, 3 tables.

Introduction
Related Work
A-Not-B Error and Inhibitory Control.
LLMs and Inhibitory Control.
LLMs and Multiple-Choice QA.
Experimental Setup
Experiment Motivation
Models
Data
MCQA Datasets
Preprocessing
Prompting Settings
Original prompting
A-Not-B prompting
Results
...and 16 more sections

Figures (11)

Figure 1: A-Not-B Style Few-Shot Prompts Mislead Gemini on Elementary School-Level Questions. This figure presents a simple few-shot prompt that tricks an advanced model Gemini on elementary school-level questions by consistently providing examples with Answer A. This example was collected on Sep 21, 2024, with the possibility that future updates may lead to different results. We provide a screenshot in Appendix §\ref{['sec:gemini']}.
Figure 2: Illustration of A-not-B Task Performance in Infants and LLMs. This figure demonstrates the typical A-not-B error using a binary (Boxes A and B) setup. The first sequence showcases an infant's repeated actions: seeing an object placed in Box A, observing it moved to Box B, yet continuing to search in Box A. This depicts the cognitive phenomenon where prior experience overrides current explicit visual cues. On the bottom sequence, the figure analogously presents a scenario in which LLMs are misled by a consistent answer pattern, illustrating the A-not-B prompting scenarios for LLMs, where they fail to adapt to minimally changed circumstances.
Figure 3: Increasing Few-Shot Examples and Smaller Model Sizes Lead to Greater Accuracy Decline. This figure shows the performance impact of increasing few-shot examples on large and small models. Negative values represent a decline in accuracy due to adversarial prompts. (1) More shots generally lead to a greater decline in performance, especially beyond 10 shots. (2) Smaller models (Llama3-8b and Qwen1.5-7b) are consistently more affected than larger models (Llama3-70b and Qwen1.5-72b), indicating higher susceptibility to the A-Not-B scenario.
Figure 4: Different Reasoning Tasks Show Susceptibility to A-Not-B Errors to Varying Degrees. This figure illustrates how different reasoning tasks experience performance drops under the A-Not-B scenario as the number of shots increases. (1) Arithmetic reasoning (MathQA) shows a significant decline in performance across all shots, indicating a high susceptibility. (2) Commonsense reasoning (CommonsenseQA) also exhibits A-Not-B errors consistently across all shots but is less impacted than arithmetic reasoning. (3) Causal reasoning (Winogrande) begins to show significant performance drops at 25 shots. (4) Scientific reasoning (SciQ) displays minimal fluctuations and only shows a slightly more noticeable impact at 25 shots.
Figure 5: Persistent Vulnerability of LLMs to A-not-B Errors Even in Many-Shot Scenarios. This figure shows the performance of large (70B) and small (8B) Llama3 models in the many-shot arithmetic reasoning task for A-not-B errors. Despite more options and examples to reduce cognitive biases, both models, especially the small one, still exhibit significant accuracy declines, highlighting LLMs' internal vulnerability to A-not-B errors.
...and 6 more figures

In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models

TL;DR

Abstract

In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)