Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task?
Zhiqiang Pi, Annapurna Vadaparty, Benjamin K. Bergen, Cameron R. Jones
TL;DR
The paper investigates why LLMs falter on Theory of Mind tasks when faced with adversarial stimulus changes, using SCALPEL to generate minimal, hypothesis-driven prompt modifications to the Unexpected Contents Task's Transparent-Access variation. It demonstrates that making explicit inferences—such as that a 'see-through' container enables recognizing its contents or that reading a label leads to looking inside—significantly improves GPT-4's accuracy to approximately $90\%$, while GPT-3.5 remains near chance. The results challenge the view that LLMs rely solely on superficial cues and suggest partial, non-robust ToM abilities that depend on explicit bridging inferences. SCALPEL is proposed as a practical, extensible toolkit for dissecting specific cognitive inferences in LLMs and guiding future human-model comparisons.
Abstract
Recent empirical results have sparked a debate about whether or not Large Language Models (LLMs) are capable of Theory of Mind (ToM). While some have found LLMs to be successful on ToM evaluations such as the False Belief task, others have shown that their performance is not robust against trivial alterations to stimuli. In this paper, we introduce SCALPEL -- a technique to incrementally modify stimuli to test different specific hypotheses about why LLMs fail -- and apply this method to the "transparent-access" modification of the unexpected contents task. Our results suggest that LLMs often do poorly because they fail to make essential common-sense inferences, such as that seeing a transparent container implies recognizing its contents. We conclude that while modern LLMs go beyond mere pattern matching, they still fall short of robust human-like ToM. We argue that SCALPEL can help cognitive scientists examine LLMs' capabilities in finer detail and provide insight into alternative mechanisms by which tasks that are used to assess human cognition might be completed.
