HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Xiaomeng Xu; Jisang Park; Han Zhang; Eric Cousineau; Aditya Bhat; Jose Barreiros; Dian Wang; Shuran Song

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Xiaomeng Xu, Jisang Park, Han Zhang, Eric Cousineau, Aditya Bhat, Jose Barreiros, Dian Wang, Shuran Song

TL;DR

This work presents Whole-Body Mobile Manipulation Interface (HoMMI), a data collection and policy learning framework that learns whole-body mobile manipulation directly from robot-free human demonstrations, and explicitly bridge the gap with a cross-embodiment hand-eye policy design.

Abstract

We present Whole-Body Mobile Manipulation Interface (HoMMI), a data collection and policy learning framework that learns whole-body mobile manipulation directly from robot-free human demonstrations. We augment UMI interfaces with egocentric sensing to capture the global context required for mobile manipulation, enabling portable, robot-free, and scalable data collection. However, naively incorporating egocentric sensing introduces a larger human-to-robot embodiment gap in both observation and action spaces, making policy transfer difficult. We explicitly bridge this gap with a cross-embodiment hand-eye policy design, including an embodiment agnostic visual representation; a relaxed head action representation; and a whole-body controller that realizes hand-eye trajectories through coordinated whole-body motion under robot-specific physical constraints. Together, these enable long-horizon mobile manipulation tasks requiring bimanual and whole-body coordination, navigation, and active perception. Results are best viewed on: https://hommi-robot.github.io

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

TL;DR

Abstract

Paper Structure (29 sections, 2 equations, 12 figures, 1 table)

This paper contains 29 sections, 2 equations, 12 figures, 1 table.

Introduction
Related Work
Data Collection Interfaces for Robot Learning
Robot Learning from Egocentric Demonstrations
Learning Mobile Manipulation From Demonstrations
Design Objectives
HoMMI Data Collection Interface
Cross-embodiment Hand-Eye Policy
3D Visual Representation to Mitigate the Visual Gap
3D Look-at Point Action Representation to Mitigate the Kinematic Gap
Gripper-Centric Frame for Spatial Awareness
Robot System
Bimanual Mobile Manipulator Hardware Setup
Constraint-Aware Whole-body Controller
Asynchronous Policy Inference
...and 14 more sections

Figures (12)

Figure 1: Whole-Body Mobile Manipulation Interface (HoMMI). (a) We extend UMI with egocentric sensing to enable scalable mobile manipulation with active perception -- capabilities that cannot be achieved with the original UMI. (b) However, the new egocentric view creates a substantial embodiment gap in both observation and action space, making policy transfer difficult. (c) We bridge this embodiment gap by carefully redesigning the visual and action representations and integrating them with a constraint-aware whole-body controller. Together, HoMMI is able to learn diverse mobile manipulation skills directly from human demonstrations, without any robot teleoperation data.
Figure 2: HoMMI System Overview. We learn whole-body mobile manipulation from human demonstrations with an intuitive data collection interface (§ \ref{['sec:data_collection']}), a cross-embodiment policy design with an embodiment-agnostic visual representation and a relaxed head action representation (§ \ref{['sec:policy']}), and a whole-body controller that achieves hand-eye tracking through whole-body motions respecting physical constraints (§ \ref{['sec:wbc']}).
Figure 3: Embodiment-Agnostic Visual Representation. We use a 3D representation for egocentric observations that allows using an embodiment-agnostic gripper coordinate frame, and masking out embodiment-specific arms and body observations.
Figure 4: Look-at Point Action Representation. To bridge the kinematic gap (e.g., height and neck DoF), we relax the head action constraint by representing the robot gaze as a "3D look-at point". This representation allows effective active perception for gathering task-relevant information without over-constraining the robot to mimic human head motions exactly.
Figure 5: HoMMI Whole-Body Controller is designed to achieve precise end-effector tracking for accurate manipulation and effective active perception for information gathering. To do so, it uses (a) a relaxed head look-at point action representation that allows accurate bimanual end-effectors SE(3) tracking, circumventing the infeasibility and increased error associated with simultaneous 6-DoF head-hand tracking. In addition, we also apply (b) constraints and regularization to ensure stability and prevent the disastrous behaviors that would otherwise occur.
...and 7 more figures

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

TL;DR

Abstract

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Authors

TL;DR

Abstract

Table of Contents

Figures (12)