Understanding Social Reasoning in Language Models with Language Models

Kanishk Gandhi; Jan-Philipp Fränken; Tobias Gerstenberg; Noah D. Goodman

Understanding Social Reasoning in Language Models with Language Models

Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, Noah D. Goodman

TL;DR

The paper introduces BigToM, a scalable benchmark for Theory-of-Mind in LLMs generated via a three-stage causal-template framework: (1) design a ToM causal template, (2) populate it with LLMs, and (3) compose test items into 5,000 scenarios. It demonstrates that model-generated items are high-quality, outperforming crowd-sourced data and rivaling expert items in human judgments. Through extensive experiments across multiple models and prompting strategies, GPT-4 shows ToM-like inference patterns close to human behavior, though with reliability gaps, while other models struggle, especially on backward-belief tasks. The work offers a principled, controllable method for evaluating social reasoning in AI and discusses limitations and avenues for more realistic, dynamic benchmarks.

Abstract

As Large Language Models (LLMs) become increasingly integrated into our everyday lives, understanding their ability to comprehend human mental states becomes critical for ensuring effective interactions. However, despite the recent attempts to assess the Theory-of-Mind (ToM) reasoning capabilities of LLMs, the degree to which these models can align with human ToM remains a nuanced topic of exploration. This is primarily due to two distinct challenges: (1) the presence of inconsistent results from previous evaluations, and (2) concerns surrounding the validity of existing evaluation methodologies. To address these challenges, we present a novel framework for procedurally generating evaluations with LLMs by populating causal templates. Using our framework, we create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations. We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations. Using BigToM, we evaluate the social reasoning capabilities of a variety of LLMs and compare model performances with human performance. Our results suggest that GPT4 has ToM capabilities that mirror human inference patterns, though less reliable, while other LLMs struggle.

Understanding Social Reasoning in Language Models with Language Models

TL;DR

Abstract

Paper Structure (22 sections, 15 figures, 12 tables)

This paper contains 22 sections, 15 figures, 12 tables.

Introduction
Related Work
Model-Written Evaluations with Causal Templates
Stage 1: Building a Causal Template
Stage 2: Populating Causal Templates With Language Models
Stage 3: Composing Test Items from Template Variables
Quality of Generated Data
Experiments
Results and Discussion
Discussion
Generating Templates
Human Experiments
Experiment 1: Human Quality Ratings
Experiment 2: Human Performance
Failure cases
...and 7 more sections

Figures (15)

Figure 1: Illustration of our template-based Theory-of-Mind (ToM) scenarios. [a] The causal template and an example scenario including prior desires, actions, and beliefs, and a causal event that changes the state of the environment. [b] Testing Forward Belief inference by manipulating an agent's percepts. TB = True Belief. FB = False Belief. [c] Forward Action inference from an agent's percepts which requires additional inferences over unknown beliefs. [d] Backward Belief inference requires joint inferences over unknown percepts and beliefs from an agent's observed actions. Error bars for human performance represent 95% bootstrapped confidence intervals of the mean.
Figure 2: [a] Three-stage method for generating evaluations: Building a causal template for the domain (left). Creating a prompt template (simplified here; see \ref{['afig:prompt_for_gen']} for the prompt) from the causal graph and populating template variables using a language model (middle). Composing test items by combining template variables (right). [b] Crowdworker ratings of our model-generated Theory-of-Mind (ToM) evaluations compared to crowd-sourced ToM evaluations and expert-written ToM evaluations. Error bars represent 95% bootstrapped confidence intervals of the mean.
Figure 3: blueModel performance (0-shot) across conditions. [a] Forward Belief inferences from percepts to beliefs. TB = True Belief. FB = False Belief. [b] Forward Action inferences from an agent's percepts which require additional inferences over unknown beliefs. [c] Backward Belief inferences over unknown percepts and beliefs from an agent's observed actions. Error bars for humans represent 95% bootstrapped confidence intervals of the mean.
Figure 4: Prompt template for generating model completions.
Figure 5: Example model completion.
...and 10 more figures

Understanding Social Reasoning in Language Models with Language Models

TL;DR

Abstract

Understanding Social Reasoning in Language Models with Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)