Table of Contents
Fetching ...

EMMA: Scaling Mobile Manipulation via Egocentric Human Data

Lawrence Y. Zhu, Pranav Kuppili, Ryan Punamiya, Patcharapong Aphiwetsa, Dhruv Patel, Simar Kareer, Sehoon Ha, Danfei Xu

TL;DR

EMMA tackles the data bottleneck in mobile manipulation by learning from egocentric human demonstrations complemented with static robot data, bypassing mobile teleoperation. It introduces a data retargeting step, a unified decoder-transformer architecture for cross-embodiment co-training, and an unsupervised phase- identification module to switch between navigation and manipulation. Across four real-world tasks, EMMA matches or surpasses teleoperation-based baselines, generalizes to unseen environments, and shows favorable scaling with more human data. This work suggests a scalable data paradigm for mobile manipulation in real-world environments.

Abstract

Scaling mobile manipulation imitation learning is bottlenecked by expensive mobile robot teleoperation. We present Egocentric Mobile MAnipulation (EMMA), an end-to-end framework training mobile manipulation policies from human mobile manipulation data with static robot data, sidestepping mobile teleoperation. To accomplish this, we co-train human full-body motion data with static robot data. In our experiments across three real-world tasks, EMMA demonstrates comparable performance to baselines trained on teleoperated mobile robot data (Mobile ALOHA), achieving higher or equivalent task performance in full task success. We find that EMMA is able to generalize to new spatial configurations and scenes, and we observe positive performance scaling as we increase the hours of human data, opening new avenues for scalable robotic learning in real-world environments. Details of this project can be found at https://ego-moma.github.io/.

EMMA: Scaling Mobile Manipulation via Egocentric Human Data

TL;DR

EMMA tackles the data bottleneck in mobile manipulation by learning from egocentric human demonstrations complemented with static robot data, bypassing mobile teleoperation. It introduces a data retargeting step, a unified decoder-transformer architecture for cross-embodiment co-training, and an unsupervised phase- identification module to switch between navigation and manipulation. Across four real-world tasks, EMMA matches or surpasses teleoperation-based baselines, generalizes to unseen environments, and shows favorable scaling with more human data. This work suggests a scalable data paradigm for mobile manipulation in real-world environments.

Abstract

Scaling mobile manipulation imitation learning is bottlenecked by expensive mobile robot teleoperation. We present Egocentric Mobile MAnipulation (EMMA), an end-to-end framework training mobile manipulation policies from human mobile manipulation data with static robot data, sidestepping mobile teleoperation. To accomplish this, we co-train human full-body motion data with static robot data. In our experiments across three real-world tasks, EMMA demonstrates comparable performance to baselines trained on teleoperated mobile robot data (Mobile ALOHA), achieving higher or equivalent task performance in full task success. We find that EMMA is able to generalize to new spatial configurations and scenes, and we observe positive performance scaling as we increase the hours of human data, opening new avenues for scalable robotic learning in real-world environments. Details of this project can be found at https://ego-moma.github.io/.

Paper Structure

This paper contains 24 sections, 4 equations, 10 figures, 1 table, 1 algorithm.

Figures (10)

  • Figure 1: EMMA learns mobile manipulation policies without collecting mobile manipulation teleoperation data. We achieve this through bridging embodiment kinematic gaps and unified co-training of mobile human data and static robot data.
  • Figure 2: Left: Architecture of joint human-robot policy learning framework, built on top of wang2024hpt. Our model processes heterogeneous human and robot data through stems and decodes them through various action heads. The navigation head is deployed on the robot during evaluation, demonstrating transfer without robot supervision. Right: Our custom low-cost bimanual mobile manipulator.
  • Figure 3: Given the ground-plane projection of the human head trajectory, we optimize Eq. \ref{['eq:retargeting']} to produce a smooth and executable path for our differential-drive robot that can be run directly and used as input for policy learning.
  • Figure 4: Cumulative success rates across subtasks for three mobile manipulation tasks. EMMA (blue), trained without mobile teleoperation data, significantly outperforms Mobile ALOHA (orange) on Grocery Shopping and Handover Wine tasks (p $<$ 0.05). Table Service variants show comparable performance. Error bars represent 95% Clopper–Pearson confidence intervals with N = 50 trials.
  • Figure 5: (a) For Handover Wine task, starting with a fixed amount of static manipulation data, we show that adding more human fullbody motion data for EMMA (blue) yields to greater performance gains compared to adding mobile robot teleoperation data collected under an equivalent amount of time for Mobile ALOHA (orange). The performance gap expands from 10% to 30% as data increases from 15 to 60 minutes. (b) EMMA generates to an unseen scene with 54% full task success rate.
  • ...and 5 more figures