Table of Contents
Fetching ...

Evaluating Theory of Mind in Question Answering

Aida Nematzadeh, Kaylee Burns, Erin Grant, Alison Gopnik, Thomas L. Griffiths

TL;DR

This paper introduces the Theory of Mind Task Dataset to rigorously test question-answering models on reasoning about others' beliefs and inconsistent world states. It benchmarks four memory-augmented neural architectures (MemN2N, Multi-Observer, EntNet, RelNet) and finds that none fully master true-, false-, or second-order false-belief reasoning, especially under distractors. The results reveal that current models struggle to maintain multiple concurrent states (world and agents' beliefs), and performance degrades with noise, highlighting limitations of existing architectures for theory-of-mind reasoning. The work emphasizes the dataset as a diagnostic tool for mental-state reasoning, rather than a language-fluency benchmark, and calls for models capable of explicit representations of beliefs and state histories. The findings have implications for designing QA systems that interact with humans and reason about others' mental states in dynamic environments.

Abstract

We propose a new dataset for evaluating question answering models with respect to their capacity to reason about beliefs. Our tasks are inspired by theory-of-mind experiments that examine whether children are able to reason about the beliefs of others, in particular when those beliefs differ from reality. We evaluate a number of recent neural models with memory augmentation. We find that all fail on our tasks, which require keeping track of inconsistent states of the world; moreover, the models' accuracy decreases notably when random sentences are introduced to the tasks at test.

Evaluating Theory of Mind in Question Answering

TL;DR

This paper introduces the Theory of Mind Task Dataset to rigorously test question-answering models on reasoning about others' beliefs and inconsistent world states. It benchmarks four memory-augmented neural architectures (MemN2N, Multi-Observer, EntNet, RelNet) and finds that none fully master true-, false-, or second-order false-belief reasoning, especially under distractors. The results reveal that current models struggle to maintain multiple concurrent states (world and agents' beliefs), and performance degrades with noise, highlighting limitations of existing architectures for theory-of-mind reasoning. The work emphasizes the dataset as a diagnostic tool for mental-state reasoning, rather than a language-fluency benchmark, and calls for models capable of explicit representations of beliefs and state histories. The findings have implications for designing QA systems that interact with humans and reason about others' mental states in dynamic environments.

Abstract

We propose a new dataset for evaluating question answering models with respect to their capacity to reason about beliefs. Our tasks are inspired by theory-of-mind experiments that examine whether children are able to reason about the beliefs of others, in particular when those beliefs differ from reality. We evaluate a number of recent neural models with memory augmentation. We find that all fail on our tasks, which require keeping track of inconsistent states of the world; moreover, the models' accuracy decreases notably when random sentences are introduced to the tasks at test.

Paper Structure

This paper contains 23 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The Sally-Anne experiment setup from baron1985does.
  • Figure 2: An example story from each of the three task types.
  • Figure 3: Memory Network and Multiple Observer Model Performance Across Task and Question Types. Pink indicates that the answer to the question is the first container that contained the object in that task. Blue indicates that the answer is the last container that contained the object before the question was asked. Grey indicates that the answer was the first container that contained the object in the entire story which may or may not be the same as the pink.