Table of Contents
Fetching ...

Evaluating Large Language Models in Theory of Mind Tasks

Michal Kosinski

TL;DR

The results show that recent large language models can solve false-belief tasks, typically used to evaluate ToM in humans, and signify the advent of more powerful and socially skilled AI—with profound positive and negative implications.

Abstract

Eleven Large Language Models (LLMs) were assessed using a custom-made battery of false-belief tasks, considered a gold standard in testing Theory of Mind (ToM) in humans. The battery included 640 prompts spread across 40 diverse tasks, each one including a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. To solve a single task, a model needed to correctly answer 16 prompts across all eight scenarios. Smaller and older models solved no tasks; GPT-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of six-year-old children observed in past studies. We explore the potential interpretation of these findings, including the intriguing possibility that ToM, previously considered exclusive to humans, may have spontaneously emerged as a byproduct of LLMs' improving language skills.

Evaluating Large Language Models in Theory of Mind Tasks

TL;DR

The results show that recent large language models can solve false-belief tasks, typically used to evaluate ToM in humans, and signify the advent of more powerful and socially skilled AI—with profound positive and negative implications.

Abstract

Eleven Large Language Models (LLMs) were assessed using a custom-made battery of false-belief tasks, considered a gold standard in testing Theory of Mind (ToM) in humans. The battery included 640 prompts spread across 40 diverse tasks, each one including a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. To solve a single task, a model needed to correctly answer 16 prompts across all eight scenarios. Smaller and older models solved no tasks; GPT-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of six-year-old children observed in past studies. We explore the potential interpretation of these findings, including the intriguing possibility that ToM, previously considered exclusive to humans, may have spontaneously emerged as a byproduct of LLMs' improving language skills.
Paper Structure (7 sections, 3 figures)

This paper contains 7 sections, 3 figures.

Figures (3)

  • Figure 1: Changes in the probabilities of ChatGPT-4's completions of Prompts 1.1 and 1.2 as the story was revealed in one-sentence increments.
  • Figure 2: Changes in the probabilities of ChatGPT-4's completions of Prompts 2.1 and 2.2 as the story was revealed to it in one-sentence increments. The last sentence of the story ("John comes back home and wants to play with the cat.") was added to Prompt 2.2, as this prompt made little sense on its own throughout most of the story.
  • Figure 3: The percentage of false-belief tasks solved by LLMs (out of 40). Each task contained a false-belief scenario, three accompanying true-belief scenarios, and the reversed versions of all four scenarios. A model had to solve 16 prompts across all eight scenarios to score a single point. The number of parameters and models' publication dates are in parentheses. The number of parameters for models in the GPT-3 family was estimated by Gao (55) and for ChatGPT-4 by Patel and Wong (56). Average children's performance on false-belief tasks was reported after a meta-analysis of 178 studies (54). Error bars represent 95% CI.