Table of Contents
Fetching ...

Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task?

Zhiqiang Pi, Annapurna Vadaparty, Benjamin K. Bergen, Cameron R. Jones

TL;DR

The paper investigates why LLMs falter on Theory of Mind tasks when faced with adversarial stimulus changes, using SCALPEL to generate minimal, hypothesis-driven prompt modifications to the Unexpected Contents Task's Transparent-Access variation. It demonstrates that making explicit inferences—such as that a 'see-through' container enables recognizing its contents or that reading a label leads to looking inside—significantly improves GPT-4's accuracy to approximately $90\%$, while GPT-3.5 remains near chance. The results challenge the view that LLMs rely solely on superficial cues and suggest partial, non-robust ToM abilities that depend on explicit bridging inferences. SCALPEL is proposed as a practical, extensible toolkit for dissecting specific cognitive inferences in LLMs and guiding future human-model comparisons.

Abstract

Recent empirical results have sparked a debate about whether or not Large Language Models (LLMs) are capable of Theory of Mind (ToM). While some have found LLMs to be successful on ToM evaluations such as the False Belief task, others have shown that their performance is not robust against trivial alterations to stimuli. In this paper, we introduce SCALPEL -- a technique to incrementally modify stimuli to test different specific hypotheses about why LLMs fail -- and apply this method to the "transparent-access" modification of the unexpected contents task. Our results suggest that LLMs often do poorly because they fail to make essential common-sense inferences, such as that seeing a transparent container implies recognizing its contents. We conclude that while modern LLMs go beyond mere pattern matching, they still fall short of robust human-like ToM. We argue that SCALPEL can help cognitive scientists examine LLMs' capabilities in finer detail and provide insight into alternative mechanisms by which tasks that are used to assess human cognition might be completed.

Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task?

TL;DR

The paper investigates why LLMs falter on Theory of Mind tasks when faced with adversarial stimulus changes, using SCALPEL to generate minimal, hypothesis-driven prompt modifications to the Unexpected Contents Task's Transparent-Access variation. It demonstrates that making explicit inferences—such as that a 'see-through' container enables recognizing its contents or that reading a label leads to looking inside—significantly improves GPT-4's accuracy to approximately , while GPT-3.5 remains near chance. The results challenge the view that LLMs rely solely on superficial cues and suggest partial, non-robust ToM abilities that depend on explicit bridging inferences. SCALPEL is proposed as a practical, extensible toolkit for dissecting specific cognitive inferences in LLMs and guiding future human-model comparisons.

Abstract

Recent empirical results have sparked a debate about whether or not Large Language Models (LLMs) are capable of Theory of Mind (ToM). While some have found LLMs to be successful on ToM evaluations such as the False Belief task, others have shown that their performance is not robust against trivial alterations to stimuli. In this paper, we introduce SCALPEL -- a technique to incrementally modify stimuli to test different specific hypotheses about why LLMs fail -- and apply this method to the "transparent-access" modification of the unexpected contents task. Our results suggest that LLMs often do poorly because they fail to make essential common-sense inferences, such as that seeing a transparent container implies recognizing its contents. We conclude that while modern LLMs go beyond mere pattern matching, they still fall short of robust human-like ToM. We argue that SCALPEL can help cognitive scientists examine LLMs' capabilities in finer detail and provide insight into alternative mechanisms by which tasks that are used to assess human cognition might be completed.
Paper Structure (11 sections, 2 figures, 1 table)

This paper contains 11 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Accuracy rates of both GPT-3.5 and GPT-4 on the original Transparent-Access modification of the Unexpected Contents task and additional modifications which included connecting inferences. The high accuracy achieved by GPT-4 on the recognize_content modification, in addition to the small improvements from the read_look, look_read, recognize_label, visualize modifications, suggest that the model is failing to make the inference that characters recognize the content of the transparent container when they look at it.
  • Figure 2: Each of the scenarios are tested for each model 20 times for each modification. Each cell in the heatmap represents the accuracy of the corresponding model on each item.