Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
Yi-Long Lu, Chunhui Zhang, Jiajun Song, Lifeng Fan, Wei Wang
TL;DR
The study interrogates whether current Theory of Mind benchmarks truly require explicit human-like reasoning or can be solved by alternative strategies. By applying rule-based RL and SFT to LLMs spanning $0.5$B to $7$B parameters and evaluating on multiple ToM datasets, it reveals a scale-dependent effect of RL: larger models (e.g., $7$B) develop high-quality, transferable belief-tracking when trained with RL, while smaller models (≤$3$B) exhibit a reasoning collapse, delivering high accuracy with minimal, non-interpretive reasoning. Surprisingly, Supervised Fine-Tuning (SFT) alone achieves competitive and generalizable performance across benchmarks, challenging the necessity of explicit mental-state simulation for current tasks. These results imply that existing ToM benchmarks may be solvable via pattern-based strategies and data structure rather than genuine, step-by-step mental-state reasoning, underscoring the need for evaluation methods that probe the depth and structure of reasoning beyond final answers.
Abstract
Theory of Mind (ToM), the ability to attribute mental states to others, is fundamental for human social intelligence and a critical capability for advanced Artificial Intelligence. Recent advancements in Large Language Models (LLMs) have shown promising performance on ToM benchmarks, raising the question: Do these benchmarks necessitate explicit human-like reasoning processes, or can models succeed through alternative strategies? We investigate this question empirically by applying Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) to LLMs of varying scales (0.5B to 7B parameters) and evaluating them across multiple ToM datasets. Our results reveal a scale-dependent impact of RL: while RL significantly improves accuracy and fosters high-quality, interpretable, and transferable belief-tracking reasoning in larger models (7B), it leads to "reasoning collapse" in smaller models ($\leq$3B), where high accuracy and generalization ability are achieved via drastically shortened, less meaningful responses. Surprisingly, further SFT achieves competitive and generalizable performance across these benchmarks, often matching or exceeding RL models in accuracy, despite not being explicitly trained to produce structured reasoning traces. These findings highlight a critical discrepancy between benchmark accuracy and the nature of learned reasoning. Our work suggests that current ToM benchmarks may be solvable without requiring the explicit, human-like simulation of mental states they were designed to probe. LLMs, particularly when scale is limited or training signals focus solely on output correctness, may leverage alternative rules effective for benchmark data structures.
