Probing the Robustness of Theory of Mind in Large Language Models

Christian Nickel; Laura Schrewe; Lucie Flek

Probing the Robustness of Theory of Mind in Large Language Models

Christian Nickel, Laura Schrewe, Lucie Flek

TL;DR

This work introduces a novel dataset of 68 tasks for probing ToM in LLMs, including potentially challenging variations which are assigned to 10 complexity classes, providing novel insights into the challenges LLMs face with those task variations.

Abstract

With the success of ChatGPT and other similarly sized SotA LLMs, claims of emergent human like social reasoning capabilities, especially Theory of Mind (ToM), in these models have appeared in the scientific literature. On the one hand those ToM-capabilities have been successfully tested using tasks styled similar to those used in psychology (Kosinski, 2023). On the other hand, follow up studies showed that those capabilities vanished when the tasks were slightly altered (Ullman, 2023). In this work we introduce a novel dataset of 68 tasks for probing ToM in LLMs, including potentially challenging variations which are assigned to 10 complexity classes. This way it is providing novel insights into the challenges LLMs face with those task variations. We evaluate the ToM performance of four SotA open source LLMs on our dataset and the dataset introduced by (Kosinski, 2023). The overall low goal accuracy across all evaluated models indicates only a limited degree of ToM capabilities. The LLMs' performance on simple complexity class tasks from both datasets are similar. Whereas we find a consistent tendency in all tested LLMs to perform poorly on tasks that require the realization that an agent has knowledge of automatic state changes in its environment, even when those are spelled out to the model. For task complications that change the relationship between objects by replacing prepositions, we notice a performance drop in all models, with the strongest impact on the mixture-of-experts model. With our dataset of tasks grouped by complexity we offer directions for further research on how to stabilize and advance ToM capabilities in LLM.

Probing the Robustness of Theory of Mind in Large Language Models

TL;DR

Abstract

Paper Structure (30 sections, 3 figures, 2 tables)

This paper contains 30 sections, 3 figures, 2 tables.

Introduction
Related Work
Methodology
Overview
Dataset Creation and Outline
Complexity Classes
automatic change knowledge
add unrelated information
induction from baseline
untrustworthy testimony
conclusion from sentiment
Models and Inference
Evaluation Metrics
Results
Llama-2-70-b-chat-hf
...and 15 more sections

Figures (3)

Figure 1: Dataset creation and evaluation pipeline
Figure 2: Overview of turn accuracy of Llama2 (blue), Vicuna (red), Mixtral(green) and Yi(purple) with regards to all the complexity classes and the overall performance
Figure 3: Overview of goal accuracy rates of Llama2 (blue), Vicuna (red), Mixtral(green) and Yi(purple) with regards to all the complexity classes and the overall performance

Probing the Robustness of Theory of Mind in Large Language Models

TL;DR

Abstract

Probing the Robustness of Theory of Mind in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)