MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
Ruoxuan Zhang, Qiyun Zheng, Zhiyu Zhou, Ziqi Liao, Siyu Wu, Jian-Yu Jiang-Lin, Bin Wen, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng
TL;DR
MindPower addresses the lack of robot-centric Theory-of-Mind in Vision-Language–Based embodied agents by introducing a Robot-Centric MindPower Reasoning Hierarchy that connects perception, belief-desire-intention reasoning, and downstream decision and action. It defines the MindPower Benchmark with two tasks and a 590-sample dataset collected in two simulators, accompanied by a two-stage training regime (SFT and GRPO) and Mind-Reward to enforce consistency and robot-centricity. Empirical results show substantial gains over baselines including GPT-4o, with notable improvements in decision and action levels, and ablations demonstrating the value of the hierarchy and reinforcement signals. The work enables more proactive, human–robot collaborative behavior and lays groundwork for real-world deployment and future multi-agent extensions.
Abstract
Theory of Mind (ToM) refers to the ability to infer others' mental states, such as beliefs, desires, and intentions. Current vision-language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent's own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.
