Table of Contents
Fetching ...

A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks

Hieu Minh "Jord" Nguyen

TL;DR

This paper surveys behavioural and representational ToM in Large Language Models (LLMs), showing that while some tasks align with human performance, ToM remains fragile and not robust across contexts. It documents internal belief representations and their influence on ToM tasks, indicating emerging cognitive structure within LLMs. The authors highlight significant safety risks in both user-facing and multi-agent settings, including privacy leakage, deception, and misalignment, and propose directions for evaluation and mitigation. Overall, the work emphasizes the need for robust evaluation frameworks and safety strategies to harness ToM capabilities responsibly as LLMs advance.

Abstract

Theory of Mind (ToM), the ability to attribute mental states to others and predict their behaviour, is fundamental to social intelligence. In this paper, we survey studies evaluating behavioural and representational ToM in Large Language Models (LLMs), identify important safety risks from advanced LLM ToM capabilities, and suggest several research directions for effective evaluation and mitigation of these risks.

A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks

TL;DR

This paper surveys behavioural and representational ToM in Large Language Models (LLMs), showing that while some tasks align with human performance, ToM remains fragile and not robust across contexts. It documents internal belief representations and their influence on ToM tasks, indicating emerging cognitive structure within LLMs. The authors highlight significant safety risks in both user-facing and multi-agent settings, including privacy leakage, deception, and misalignment, and propose directions for evaluation and mitigation. Overall, the work emphasizes the need for robust evaluation frameworks and safety strategies to harness ToM capabilities responsibly as LLMs advance.

Abstract

Theory of Mind (ToM), the ability to attribute mental states to others and predict their behaviour, is fundamental to social intelligence. In this paper, we survey studies evaluating behavioural and representational ToM in Large Language Models (LLMs), identify important safety risks from advanced LLM ToM capabilities, and suggest several research directions for effective evaluation and mitigation of these risks.

Paper Structure

This paper contains 10 sections.