Table of Contents
Fetching ...

Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?

Yi-Long Lu, Chunhui Zhang, Jiajun Song, Lifeng Fan, Wei Wang

TL;DR

The study interrogates whether current Theory of Mind benchmarks truly require explicit human-like reasoning or can be solved by alternative strategies. By applying rule-based RL and SFT to LLMs spanning $0.5$B to $7$B parameters and evaluating on multiple ToM datasets, it reveals a scale-dependent effect of RL: larger models (e.g., $7$B) develop high-quality, transferable belief-tracking when trained with RL, while smaller models (≤$3$B) exhibit a reasoning collapse, delivering high accuracy with minimal, non-interpretive reasoning. Surprisingly, Supervised Fine-Tuning (SFT) alone achieves competitive and generalizable performance across benchmarks, challenging the necessity of explicit mental-state simulation for current tasks. These results imply that existing ToM benchmarks may be solvable via pattern-based strategies and data structure rather than genuine, step-by-step mental-state reasoning, underscoring the need for evaluation methods that probe the depth and structure of reasoning beyond final answers.

Abstract

Theory of Mind (ToM), the ability to attribute mental states to others, is fundamental for human social intelligence and a critical capability for advanced Artificial Intelligence. Recent advancements in Large Language Models (LLMs) have shown promising performance on ToM benchmarks, raising the question: Do these benchmarks necessitate explicit human-like reasoning processes, or can models succeed through alternative strategies? We investigate this question empirically by applying Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) to LLMs of varying scales (0.5B to 7B parameters) and evaluating them across multiple ToM datasets. Our results reveal a scale-dependent impact of RL: while RL significantly improves accuracy and fosters high-quality, interpretable, and transferable belief-tracking reasoning in larger models (7B), it leads to "reasoning collapse" in smaller models ($\leq$3B), where high accuracy and generalization ability are achieved via drastically shortened, less meaningful responses. Surprisingly, further SFT achieves competitive and generalizable performance across these benchmarks, often matching or exceeding RL models in accuracy, despite not being explicitly trained to produce structured reasoning traces. These findings highlight a critical discrepancy between benchmark accuracy and the nature of learned reasoning. Our work suggests that current ToM benchmarks may be solvable without requiring the explicit, human-like simulation of mental states they were designed to probe. LLMs, particularly when scale is limited or training signals focus solely on output correctness, may leverage alternative rules effective for benchmark data structures.

Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?

TL;DR

The study interrogates whether current Theory of Mind benchmarks truly require explicit human-like reasoning or can be solved by alternative strategies. By applying rule-based RL and SFT to LLMs spanning B to B parameters and evaluating on multiple ToM datasets, it reveals a scale-dependent effect of RL: larger models (e.g., B) develop high-quality, transferable belief-tracking when trained with RL, while smaller models (≤B) exhibit a reasoning collapse, delivering high accuracy with minimal, non-interpretive reasoning. Surprisingly, Supervised Fine-Tuning (SFT) alone achieves competitive and generalizable performance across benchmarks, challenging the necessity of explicit mental-state simulation for current tasks. These results imply that existing ToM benchmarks may be solvable via pattern-based strategies and data structure rather than genuine, step-by-step mental-state reasoning, underscoring the need for evaluation methods that probe the depth and structure of reasoning beyond final answers.

Abstract

Theory of Mind (ToM), the ability to attribute mental states to others, is fundamental for human social intelligence and a critical capability for advanced Artificial Intelligence. Recent advancements in Large Language Models (LLMs) have shown promising performance on ToM benchmarks, raising the question: Do these benchmarks necessitate explicit human-like reasoning processes, or can models succeed through alternative strategies? We investigate this question empirically by applying Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) to LLMs of varying scales (0.5B to 7B parameters) and evaluating them across multiple ToM datasets. Our results reveal a scale-dependent impact of RL: while RL significantly improves accuracy and fosters high-quality, interpretable, and transferable belief-tracking reasoning in larger models (7B), it leads to "reasoning collapse" in smaller models (3B), where high accuracy and generalization ability are achieved via drastically shortened, less meaningful responses. Surprisingly, further SFT achieves competitive and generalizable performance across these benchmarks, often matching or exceeding RL models in accuracy, despite not being explicitly trained to produce structured reasoning traces. These findings highlight a critical discrepancy between benchmark accuracy and the nature of learned reasoning. Our work suggests that current ToM benchmarks may be solvable without requiring the explicit, human-like simulation of mental states they were designed to probe. LLMs, particularly when scale is limited or training signals focus solely on output correctness, may leverage alternative rules effective for benchmark data structures.

Paper Structure

This paper contains 50 sections, 2 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Model response length and training dynamics. (A) Average response length on Hi-ToM, where models smaller than 3B collapsed their responses after RL training. (B) Training scores continued to rise throughout training (only 3B, 7B, and 7B-1M models are shown for visualization). (C) Response length dynamics during training: the 3B model (light blue) showed a decreasing trend, while the 7B models maintained longer responses. The color bands stand for 90% CI.
  • Figure 2: Analysis of model reasoning on fourth-order ToM tasks. (A) ToM accuracy versus thinking quality. The 7B-1M model achieves the highest accuracy with good reasoning quality. (B) Teaching GPT-4o-mini using model-generated reasoning. The 7B-1M's reasoning helps GPT-4o-mini surpass DeepSeek-v3 (gray line), even when conclusions are removed from the input. Error bars stand for 90% CI.