Table of Contents
Fetching ...

MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents

Ruoxuan Zhang, Qiyun Zheng, Zhiyu Zhou, Ziqi Liao, Siyu Wu, Jian-Yu Jiang-Lin, Bin Wen, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng

TL;DR

MindPower addresses the lack of robot-centric Theory-of-Mind in Vision-Language–Based embodied agents by introducing a Robot-Centric MindPower Reasoning Hierarchy that connects perception, belief-desire-intention reasoning, and downstream decision and action. It defines the MindPower Benchmark with two tasks and a 590-sample dataset collected in two simulators, accompanied by a two-stage training regime (SFT and GRPO) and Mind-Reward to enforce consistency and robot-centricity. Empirical results show substantial gains over baselines including GPT-4o, with notable improvements in decision and action levels, and ablations demonstrating the value of the hierarchy and reinforcement signals. The work enables more proactive, human–robot collaborative behavior and lays groundwork for real-world deployment and future multi-agent extensions.

Abstract

Theory of Mind (ToM) refers to the ability to infer others' mental states, such as beliefs, desires, and intentions. Current vision-language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent's own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.

MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents

TL;DR

MindPower addresses the lack of robot-centric Theory-of-Mind in Vision-Language–Based embodied agents by introducing a Robot-Centric MindPower Reasoning Hierarchy that connects perception, belief-desire-intention reasoning, and downstream decision and action. It defines the MindPower Benchmark with two tasks and a 590-sample dataset collected in two simulators, accompanied by a two-stage training regime (SFT and GRPO) and Mind-Reward to enforce consistency and robot-centricity. Empirical results show substantial gains over baselines including GPT-4o, with notable improvements in decision and action levels, and ablations demonstrating the value of the hierarchy and reinforcement signals. The work enables more proactive, human–robot collaborative behavior and lays groundwork for real-world deployment and future multi-agent extensions.

Abstract

Theory of Mind (ToM) refers to the ability to infer others' mental states, such as beliefs, desires, and intentions. Current vision-language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent's own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.

Paper Structure

This paper contains 34 sections, 6 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: MindPower Benchmark Overview. We evaluate Robot-Centric ToM through two tasks: False-Belief Correction and Implicit Goal Inference & Completion, assessing whether VLM-based embodied agents can generate correct decisions and actions. We further propose the MindPower Reasoning Hierarchy, comprising three levels and six layers. Existing VLMs perform poorly across layers, especially in action reasoning, while our model shows substantial improvements. A detailed example is provided in Supp. Sec. B.
  • Figure 2: MindPower Reasoning Hierarchy. The agent first receives multimodal input, then performs mental reasoning to form beliefs, desires, and intentions, and finally makes decisions and generate action plan based on this reasoning.
  • Figure 3: Robot-Centric MindPower Reasoning Hierarchy. Existing benchmarks, such as MuMA-ToM, include only Stage 1 and Stage 2 of the video, and focus solely on inferring the mental reasoning of the human (Alice) in the input video. Our dataset additionally includes Stage 3, where Alice returns to search for the item. Moreover, in Level-2 (Mental Reasoning) of MindPower, we infer the mental reasoning of both the embodied agent and the human, whereas existing ToM Benchmarks only infer the role’s mental state through multiple-choice questions. Detailed example is provided in Sec. B of the Supplementary Material.
  • Figure 4: Experiments on MindPower Benchmark.
  • Figure 5: Reward Formulation. The overall reward integrates both the Mind-Reward and the Format-Reward components.
  • ...and 16 more figures