Table of Contents
Fetching ...

AToM-Bot: Embodied Fulfillment of Unspoken Human Needs with Affective Theory of Mind

Wei Ding, Fanhong Li, Ziteng Ji, Zhengrong Xue, Jia Liu

TL;DR

AToM-Bot presents a proactive robot framework that inferences unspoken human needs from multimodal cues via an Affective Theory of Mind and uses a Vision-Language Model to generate and execute feasible tasks under embodiment constraints. The approach grounds perception, reasoning, and action with open-vocabulary manipulation, Grounding SAM, and DINOBot-style alignment, evaluated across 16 scenarios and 118 participants. Key findings show strong alignment with human expectations in need detection, solution embodiment, and task execution, with substantial satisfaction and practical success. The work highlights the potential of affective-ToM-guided proactive HRI for seamless daily-life assistance and outlines concrete avenues for multi-modal sensing and personalization.

Abstract

We propose AToM-Bot, a novel task generation and execution framework for proactive robot-human interaction, which leverages the human mental and physical state inference capabilities of the Vision Language Model (VLM) prompted by the Affective Theory of Mind (AToM). Without requiring explicit commands by humans, AToM-Bot proactively generates and follows feasible tasks to improve general human well-being. When around humans, AToM-Bot first detects current human needs based on inferred human states and observations of the surrounding environment. It then generates tasks to fulfill these needs, taking into account its embodied constraints. We designed 16 daily life scenarios spanning 4 common scenes and tasked the same visual stimulus to 59 human subjects and our robot. We used the similarity between human open-ended answers and robot output, and the human satisfaction scores to metric robot performance. AToM-Bot received high human evaluations in need detection (6.42/7, 91.7%), embodied solution (6.15/7, 87.8%) and task execution (6.17/7, 88.1%). We show that AToM-Bot excels in generating and executing feasible plans to fulfill unspoken human needs. Videos and code are available at https://affective-tom-bot.github.io.

AToM-Bot: Embodied Fulfillment of Unspoken Human Needs with Affective Theory of Mind

TL;DR

AToM-Bot presents a proactive robot framework that inferences unspoken human needs from multimodal cues via an Affective Theory of Mind and uses a Vision-Language Model to generate and execute feasible tasks under embodiment constraints. The approach grounds perception, reasoning, and action with open-vocabulary manipulation, Grounding SAM, and DINOBot-style alignment, evaluated across 16 scenarios and 118 participants. Key findings show strong alignment with human expectations in need detection, solution embodiment, and task execution, with substantial satisfaction and practical success. The work highlights the potential of affective-ToM-guided proactive HRI for seamless daily-life assistance and outlines concrete avenues for multi-modal sensing and personalization.

Abstract

We propose AToM-Bot, a novel task generation and execution framework for proactive robot-human interaction, which leverages the human mental and physical state inference capabilities of the Vision Language Model (VLM) prompted by the Affective Theory of Mind (AToM). Without requiring explicit commands by humans, AToM-Bot proactively generates and follows feasible tasks to improve general human well-being. When around humans, AToM-Bot first detects current human needs based on inferred human states and observations of the surrounding environment. It then generates tasks to fulfill these needs, taking into account its embodied constraints. We designed 16 daily life scenarios spanning 4 common scenes and tasked the same visual stimulus to 59 human subjects and our robot. We used the similarity between human open-ended answers and robot output, and the human satisfaction scores to metric robot performance. AToM-Bot received high human evaluations in need detection (6.42/7, 91.7%), embodied solution (6.15/7, 87.8%) and task execution (6.17/7, 88.1%). We show that AToM-Bot excels in generating and executing feasible plans to fulfill unspoken human needs. Videos and code are available at https://affective-tom-bot.github.io.
Paper Structure (45 sections, 6 equations, 3 figures, 3 tables)

This paper contains 45 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: AToM-Bot is a novel task generation and execution framework for proactive robot-human interaction, towards the embodied fulfillment of unspoken human needs.
  • Figure 2: Overview of AToM-Bot, a robotic system for identifying and responding to human needs. It integrates human observations and environmental attributes to infer human needs. It then generates tasks for a robot by navigating to objects, manipulating them, and assisting human in daily setting.
  • Figure 3: Examples of needs and solutions generated by AToM-Bot for selected task scenarios. The bar graphs display the proportion of responses indicating perceived needs (brown) and potential solutions (green) for specific daily activities, including eating spicy food (1), feeling tired while working (7), doing balance yoga exercises (8), and cooking (3). Each task scenario shows the responses from 118 participants, where each participant could provide one or more responses. The displayed proportions are calculated as the percentage of each response type relative to the total number of participants (118). The displayed proportions are calculated as the percentage of each response type relative to the total number of participants (118). Additional 12 tasks are shown at the bottom.