LLMs achieve adult human performance on higher-order theory of mind tasks

Winnie Street; John Oliver Siy; Geoff Keeling; Adrien Baranes; Benjamin Barnett; Michael McKibben; Tatenda Kanyere; Alison Lentz; Blaise Aguera y Arcas; Robin I. M. Dunbar

LLMs achieve adult human performance on higher-order theory of mind tasks

Winnie Street, John Oliver Siy, Geoff Keeling, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Blaise Aguera y Arcas, Robin I. M. Dunbar

TL;DR

This work probes whether LLMs can perform higher-order theory of mind (ToM) tasks up to order $6$ by introducing MoToMQA and benchmarking five LLMs against an adult human dataset. Using logprob-based scoring of true/false responses across seven stories and 140 statements, the study finds GPT-4 and Flan-PaLM achieving adult-level or near-adult-level ToM performance, with GPT-4 exceeding human accuracy on the $6^{th}$ order. The results implicate model size and instruction finetuning as key drivers of ToM capabilities, and they reveal robust ToM performance in the strongest models even under prompt perturbations. These findings carry important implications for user-facing LLM applications, highlighting both potential benefits for cooperative tasks and risks related to advanced manipulation or misinterpretation of others' mental states.

Abstract

This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.

LLMs achieve adult human performance on higher-order theory of mind tasks

TL;DR

This work probes whether LLMs can perform higher-order theory of mind (ToM) tasks up to order

by introducing MoToMQA and benchmarking five LLMs against an adult human dataset. Using logprob-based scoring of true/false responses across seven stories and 140 statements, the study finds GPT-4 and Flan-PaLM achieving adult-level or near-adult-level ToM performance, with GPT-4 exceeding human accuracy on the

order. The results implicate model size and instruction finetuning as key drivers of ToM capabilities, and they reveal robust ToM performance in the strongest models even under prompt perturbations. These findings carry important implications for user-facing LLM applications, highlighting both potential benefits for cooperative tasks and risks related to advanced manipulation or misinterpretation of others' mental states.

Abstract

Paper Structure (26 sections, 1 equation, 1 figure, 7 tables)

This paper contains 26 sections, 1 equation, 1 figure, 7 tables.

Introduction
Related work
Higher-order ToM
LLM ToM
Materials and method
Procedures
Human procedure
LLM procedure
Dataset creation
Results
ToM task performance
Factual task performance
Comparing performance on ToM and factual tasks
Anchoring effect
Discussion
...and 11 more sections

Figures (1)

Figure 1: Human, LaMDA, PaLM, Flan-PaLM, GPT-3.5 and GPT-4 performance on ToM tasks up to order 6

LLMs achieve adult human performance on higher-order theory of mind tasks

TL;DR

Abstract

LLMs achieve adult human performance on higher-order theory of mind tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (1)