Table of Contents
Fetching ...

SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs

Yuling Gu, Oyvind Tafjord, Hyunwoo Kim, Jared Moore, Ronan Le Bras, Peter Clark, Yejin Choi

TL;DR

This work creates a new dataset, SimpleTom, containing concise, diverse stories, and is the first dataset to systematically explore downstream reasoning requiring knowledge of mental states in realistic scenarios, and shows that models can be coaxed to perform well, but requires task-specific interventions.

Abstract

While prior work has explored whether large language models (LLMs) possess a "theory of mind" (ToM) - the ability to attribute mental states to oneself and others - there has been little work testing whether LLMs can implicitly apply such knowledge to predict behavior, or to judge whether an observed behavior is rational. Such skills are critical for appropriate interaction in social environments. We create a new dataset, SimpleTom, containing concise, diverse stories (e.g., "The can of Pringles has moldy chips in it. Mary picks up the can in the supermarket and walks to the cashier."), each with three questions that test different degrees of ToM reasoning, asking models to predict (a) mental state ("Is Mary aware of the mold?"), (b) behavior ("Will Mary pay for the chips or report the mold?"), and (c) judgment ("Mary paid for the chips. Was that reasonable?"). To our knowledge, SimpleToM is the first dataset to systematically explore downstream reasoning requiring knowledge of mental states in realistic scenarios. Our experimental results are intriguing: While most models can reliably predict mental state on our dataset (a), they often fail to correctly predict the behavior (b), and fare even worse at judging whether given behaviors are reasonable (c), despite being correctly aware of the protagonist's mental state should make such secondary predictions obvious. We further show that we can help models do better at (b) and (c) via interventions such as reminding the model of its earlier mental state answer and mental-state-specific chain-of-thought prompting, raising the action prediction accuracies (e.g., from 49.5% to 93.5% for GPT-4o) and judgment accuracies (e.g., from 15.3% to 94.7% in GPT-4o). While this shows that models can be coaxed to perform well, it requires task-specific interventions, and the natural model performances remain low, a cautionary tale for LLM deployment.

SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs

TL;DR

This work creates a new dataset, SimpleTom, containing concise, diverse stories, and is the first dataset to systematically explore downstream reasoning requiring knowledge of mental states in realistic scenarios, and shows that models can be coaxed to perform well, but requires task-specific interventions.

Abstract

While prior work has explored whether large language models (LLMs) possess a "theory of mind" (ToM) - the ability to attribute mental states to oneself and others - there has been little work testing whether LLMs can implicitly apply such knowledge to predict behavior, or to judge whether an observed behavior is rational. Such skills are critical for appropriate interaction in social environments. We create a new dataset, SimpleTom, containing concise, diverse stories (e.g., "The can of Pringles has moldy chips in it. Mary picks up the can in the supermarket and walks to the cashier."), each with three questions that test different degrees of ToM reasoning, asking models to predict (a) mental state ("Is Mary aware of the mold?"), (b) behavior ("Will Mary pay for the chips or report the mold?"), and (c) judgment ("Mary paid for the chips. Was that reasonable?"). To our knowledge, SimpleToM is the first dataset to systematically explore downstream reasoning requiring knowledge of mental states in realistic scenarios. Our experimental results are intriguing: While most models can reliably predict mental state on our dataset (a), they often fail to correctly predict the behavior (b), and fare even worse at judging whether given behaviors are reasonable (c), despite being correctly aware of the protagonist's mental state should make such secondary predictions obvious. We further show that we can help models do better at (b) and (c) via interventions such as reminding the model of its earlier mental state answer and mental-state-specific chain-of-thought prompting, raising the action prediction accuracies (e.g., from 49.5% to 93.5% for GPT-4o) and judgment accuracies (e.g., from 15.3% to 94.7% in GPT-4o). While this shows that models can be coaxed to perform well, it requires task-specific interventions, and the natural model performances remain low, a cautionary tale for LLM deployment.

Paper Structure

This paper contains 37 sections, 13 figures, 8 tables.

Figures (13)

  • Figure 1: To allow for a nuanced analysis of models' neural ToM abilities, SimpleToM covers both explicit ToM (a) and applied ToM (b, c) question types. SimpleToM measures the ability of LLMs to (a) infer the character's mental state, specifically information awareness, (b) anticipate their likely next behavior in the given situation, and (c) make appropriate judgment of the character's behavior that correctly accounts for their mental state.
  • Figure 2: We leverage the generative strength of language models to obtain concise stories with varied entities and diverse situations, suitable for testing different levels of ToM reasoning. The generated stories (and answer options) were then rigorously filtered by careful human annotators who passed a strict qualification test. The result is a high-quality and diverse dataset, SimpleToM.
  • Figure 3: Considering the sequence of first predicting mental state, then behavior and finally judgment, we can record a failure for the first mistake. The "fail at behavior prediction" and "fail at judgment" parts can be considered inconsistent predictions by the model, since it got the associated mental state (and behavior prediction) questions correct.
  • Figure 4: Comparing performance for all three question types across select scenarios and models. Each bar represents the overall accuracy. The mental state accuracy is generally near 100%, while behavior prediction and judgment accuracies are often much lower.
  • Figure 5: Instructions presented to Amazon Mechanical Turk workers.
  • ...and 8 more figures