Table of Contents
Fetching ...

Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant

Anxing Xiao, Nuwan Janaka, Tianrun Hu, Anshul Gupta, Kaixin Li, Cunjun Yu, David Hsu

TL;DR

Robi Butler tackles remote, multimodal control of household robots by integrating Zoom-based communication, gesture grounding, and MR viewing with an LLM-driven high-level planner and open-vocabulary primitive skills. The system grounds language and gestures into executable action sequences $P = \{a_0, a_1, ..., a_N\}$ via a rule-based alignment (e.g., handling demonstratives with a placeholder $\ast$ resolved by gesture history) and grounding through vision-language models (OWLv2, Segment Anything) and motion planning (MoveIt). It contributes a three-component architecture, end-to-end grounding for manipulation, navigation, and VQA, and empirical evidence showing high task success (mean Task SR $= 96.7\%$) and favorable user experience when using multimodal input, outperforming unimodal baselines in usability and trust. The work advances practical remote household robotics by demonstrating open-vocabulary grounding, real-time collaboration between humans and robots, and actionable insights for designing future multimodal HRI systems with independent, scalable components.

Abstract

Imagine a future when we can Zoom-call a robot to manage household chores remotely. This work takes one step in this direction. Robi Butler is a new household robot assistant that enables seamless multimodal remote interaction. It allows the human user to monitor its environment from a first-person view, issue voice or text commands, and specify target objects through hand-pointing gestures. At its core, a high-level behavior module, powered by Large Language Models (LLMs), interprets multimodal instructions to generate multistep action plans. Each plan consists of open-vocabulary primitives supported by vision-language models, enabling the robot to process both textual and gestural inputs. Zoom provides a convenient interface to implement remote interactions between the human and the robot. The integration of these components allows Robi Butler to ground remote multimodal instructions in real-world home environments in a zero-shot manner. We evaluated the system on various household tasks, demonstrating its ability to execute complex user commands with multimodal inputs. We also conducted a user study to examine how multimodal interaction influences user experiences in remote human-robot interaction. These results suggest that with the advances in robot foundation models, we are moving closer to the reality of remote household robot assistants.

Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant

TL;DR

Robi Butler tackles remote, multimodal control of household robots by integrating Zoom-based communication, gesture grounding, and MR viewing with an LLM-driven high-level planner and open-vocabulary primitive skills. The system grounds language and gestures into executable action sequences via a rule-based alignment (e.g., handling demonstratives with a placeholder resolved by gesture history) and grounding through vision-language models (OWLv2, Segment Anything) and motion planning (MoveIt). It contributes a three-component architecture, end-to-end grounding for manipulation, navigation, and VQA, and empirical evidence showing high task success (mean Task SR ) and favorable user experience when using multimodal input, outperforming unimodal baselines in usability and trust. The work advances practical remote household robotics by demonstrating open-vocabulary grounding, real-time collaboration between humans and robots, and actionable insights for designing future multimodal HRI systems with independent, scalable components.

Abstract

Imagine a future when we can Zoom-call a robot to manage household chores remotely. This work takes one step in this direction. Robi Butler is a new household robot assistant that enables seamless multimodal remote interaction. It allows the human user to monitor its environment from a first-person view, issue voice or text commands, and specify target objects through hand-pointing gestures. At its core, a high-level behavior module, powered by Large Language Models (LLMs), interprets multimodal instructions to generate multistep action plans. Each plan consists of open-vocabulary primitives supported by vision-language models, enabling the robot to process both textual and gestural inputs. Zoom provides a convenient interface to implement remote interactions between the human and the robot. The integration of these components allows Robi Butler to ground remote multimodal instructions in real-world home environments in a zero-shot manner. We evaluated the system on various household tasks, demonstrating its ability to execute complex user commands with multimodal inputs. We also conducted a user study to examine how multimodal interaction influences user experiences in remote human-robot interaction. These results suggest that with the advances in robot foundation models, we are moving closer to the reality of remote household robot assistants.
Paper Structure (22 sections, 11 figures, 1 table)

This paper contains 22 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: The Robi Butler system enables the user to Zoom-call the butler robot remotely at home and interact with it naturally through both the language and hand gestures.
  • Figure 2: An overview of Robi Butler. The system consists of three components: Communication Interface, High-level Behavior Module, and Primitive Skills. The Communication Interfaces transmit the inputs received from the remote user to the High-level Behavior Module, which composes the Primitive Skill to interact with the environment to fulfill the instructions or answer questions.
  • Figure 3: The framework of communication interfaces.
  • Figure 4: The framework of high-level behavior module.
  • Figure 5: Human-Robot Remote interactions via language and gestures.
  • ...and 6 more figures