Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant

Anxing Xiao; Nuwan Janaka; Tianrun Hu; Anshul Gupta; Kaixin Li; Cunjun Yu; David Hsu

Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant

Anxing Xiao, Nuwan Janaka, Tianrun Hu, Anshul Gupta, Kaixin Li, Cunjun Yu, David Hsu

TL;DR

Robi Butler tackles remote, multimodal control of household robots by integrating Zoom-based communication, gesture grounding, and MR viewing with an LLM-driven high-level planner and open-vocabulary primitive skills. The system grounds language and gestures into executable action sequences $P = \{a_0, a_1, ..., a_N\}$ via a rule-based alignment (e.g., handling demonstratives with a placeholder $\ast$ resolved by gesture history) and grounding through vision-language models (OWLv2, Segment Anything) and motion planning (MoveIt). It contributes a three-component architecture, end-to-end grounding for manipulation, navigation, and VQA, and empirical evidence showing high task success (mean Task SR $= 96.7\%$) and favorable user experience when using multimodal input, outperforming unimodal baselines in usability and trust. The work advances practical remote household robotics by demonstrating open-vocabulary grounding, real-time collaboration between humans and robots, and actionable insights for designing future multimodal HRI systems with independent, scalable components.

Abstract

Imagine a future when we can Zoom-call a robot to manage household chores remotely. This work takes one step in this direction. Robi Butler is a new household robot assistant that enables seamless multimodal remote interaction. It allows the human user to monitor its environment from a first-person view, issue voice or text commands, and specify target objects through hand-pointing gestures. At its core, a high-level behavior module, powered by Large Language Models (LLMs), interprets multimodal instructions to generate multistep action plans. Each plan consists of open-vocabulary primitives supported by vision-language models, enabling the robot to process both textual and gestural inputs. Zoom provides a convenient interface to implement remote interactions between the human and the robot. The integration of these components allows Robi Butler to ground remote multimodal instructions in real-world home environments in a zero-shot manner. We evaluated the system on various household tasks, demonstrating its ability to execute complex user commands with multimodal inputs. We also conducted a user study to examine how multimodal interaction influences user experiences in remote human-robot interaction. These results suggest that with the advances in robot foundation models, we are moving closer to the reality of remote household robot assistants.

Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant

TL;DR

via a rule-based alignment (e.g., handling demonstratives with a placeholder

resolved by gesture history) and grounding through vision-language models (OWLv2, Segment Anything) and motion planning (MoveIt). It contributes a three-component architecture, end-to-end grounding for manipulation, navigation, and VQA, and empirical evidence showing high task success (mean Task SR

) and favorable user experience when using multimodal input, outperforming unimodal baselines in usability and trust. The work advances practical remote household robotics by demonstrating open-vocabulary grounding, real-time collaboration between humans and robots, and actionable insights for designing future multimodal HRI systems with independent, scalable components.

Abstract

Paper Structure (22 sections, 11 figures, 1 table)

This paper contains 22 sections, 11 figures, 1 table.

Introduction
Related Work
Language and Gestures in Human-Robot Interaction
Household Robot Assistant
Overview
System Overview
Hardware Setup
System Implementation
Communication Interfaces
High-level Behavior Module
Primitive Skills
Manipulation
Navigation
Visual Question Answering
Experiments and Results
...and 7 more sections

Figures (11)

Figure 1: The Robi Butler system enables the user to Zoom-call the butler robot remotely at home and interact with it naturally through both the language and hand gestures.
Figure 2: An overview of Robi Butler. The system consists of three components: Communication Interface, High-level Behavior Module, and Primitive Skills. The Communication Interfaces transmit the inputs received from the remote user to the High-level Behavior Module, which composes the Primitive Skill to interact with the environment to fulfill the instructions or answer questions.
Figure 3: The framework of communication interfaces.
Figure 4: The framework of high-level behavior module.
Figure 5: Human-Robot Remote interactions via language and gestures.
...and 6 more figures

Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant

TL;DR

Abstract

Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant

Authors

TL;DR

Abstract

Table of Contents

Figures (11)