Table of Contents
Fetching ...

HoMeR: Learning In-the-Wild Mobile Manipulation via Hybrid Imitation and Whole-Body Control

Priya Sundaresan, Rhea Malhotra, Phillip Miao, Jingyun Yang, Jimmy Wu, Hengyuan Hu, Rika Antonova, Francis Engelmann, Dorsa Sadigh, Jeannette Bohg

TL;DR

HoMeR tackles in-the-wild mobile manipulation by marrying a fast IK-based whole-body controller with a hybrid imitation-learning policy that alternates between absolute keypose actions for long-range movement and dense delta actions for fine-grained manipulation in $SE(3)$. The approach learns from a small set of demonstrations and optionally leverages vision-language model (VLM) keypoints to ground goals, enabling generalization to novel objects and clutter. Empirically, HoMeR achieves an overall success rate of 79.17% across six tasks with only 20 demonstrations per task, outperforming non-hybrid baselines by substantial margins and demonstrating robust real-world performance. The modular architecture, including VLM conditioning and a diffusion-based dense policy, provides a scalable path toward deployable, generalizable assistive mobile manipulators in real homes.

Abstract

We introduce HoMeR, an imitation learning framework for mobile manipulation that combines whole-body control with hybrid action modes that handle both long-range and fine-grained motion, enabling effective performance on realistic in-the-wild tasks. At its core is a fast, kinematics-based whole-body controller that maps desired end-effector poses to coordinated motion across the mobile base and arm. Within this reduced end-effector action space, HoMeR learns to switch between absolute pose predictions for long-range movement and relative pose predictions for fine-grained manipulation, offloading low-level coordination to the controller and focusing learning on task-level decisions. We deploy HoMeR on a holonomic mobile manipulator with a 7-DoF arm in a real home. We compare HoMeR to baselines without hybrid actions or whole-body control across 3 simulated and 3 real household tasks such as opening cabinets, sweeping trash, and rearranging pillows. Across tasks, HoMeR achieves an overall success rate of 79.17% using just 20 demonstrations per task, outperforming the next best baseline by 29.17 on average. HoMeR is also compatible with vision-language models and can leverage their internet-scale priors to better generalize to novel object appearances, layouts, and cluttered scenes. In summary, HoMeR moves beyond tabletop settings and demonstrates a scalable path toward sample-efficient, generalizable manipulation in everyday indoor spaces. Code, videos, and supplementary material are available at: http://homer-manip.github.io

HoMeR: Learning In-the-Wild Mobile Manipulation via Hybrid Imitation and Whole-Body Control

TL;DR

HoMeR tackles in-the-wild mobile manipulation by marrying a fast IK-based whole-body controller with a hybrid imitation-learning policy that alternates between absolute keypose actions for long-range movement and dense delta actions for fine-grained manipulation in . The approach learns from a small set of demonstrations and optionally leverages vision-language model (VLM) keypoints to ground goals, enabling generalization to novel objects and clutter. Empirically, HoMeR achieves an overall success rate of 79.17% across six tasks with only 20 demonstrations per task, outperforming non-hybrid baselines by substantial margins and demonstrating robust real-world performance. The modular architecture, including VLM conditioning and a diffusion-based dense policy, provides a scalable path toward deployable, generalizable assistive mobile manipulators in real homes.

Abstract

We introduce HoMeR, an imitation learning framework for mobile manipulation that combines whole-body control with hybrid action modes that handle both long-range and fine-grained motion, enabling effective performance on realistic in-the-wild tasks. At its core is a fast, kinematics-based whole-body controller that maps desired end-effector poses to coordinated motion across the mobile base and arm. Within this reduced end-effector action space, HoMeR learns to switch between absolute pose predictions for long-range movement and relative pose predictions for fine-grained manipulation, offloading low-level coordination to the controller and focusing learning on task-level decisions. We deploy HoMeR on a holonomic mobile manipulator with a 7-DoF arm in a real home. We compare HoMeR to baselines without hybrid actions or whole-body control across 3 simulated and 3 real household tasks such as opening cabinets, sweeping trash, and rearranging pillows. Across tasks, HoMeR achieves an overall success rate of 79.17% using just 20 demonstrations per task, outperforming the next best baseline by 29.17 on average. HoMeR is also compatible with vision-language models and can leverage their internet-scale priors to better generalize to novel object appearances, layouts, and cluttered scenes. In summary, HoMeR moves beyond tabletop settings and demonstrates a scalable path toward sample-efficient, generalizable manipulation in everyday indoor spaces. Code, videos, and supplementary material are available at: http://homer-manip.github.io

Paper Structure

This paper contains 30 sections, 4 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: HoMeR. Left: A demonstrator uses whole-body iPhone teleoperation to collect data with a mobile manipulator in a real home. Right: From these collected demonstrations, HoMeR learns a hybrid imitation learning policy that switches between absolute actions for reaching, and relative actions for fine manipulation. A whole-body controller maps these end-effector commands to arm and base joint commands for execution.
  • Figure 2: HoMeR Policy Architecture:HoMeR consists of a dense policy that uses RGB images to predict relative actions for fine-grained manipulation, and a keypose policy that uses point clouds to predict absolute end-effector poses for long-range motion. Each policy also predicts the next control mode, enabling learned transitions. Optionally, the keypose policy can be conditioned on externally provided salient points derived from a VLM to support dynamic goal specification (HoMeR-Cond). Finally, a whole-body controller (WBC) converts predicted end-effector actions into joint commands for the mobile base and arm.
  • Figure 3: Hardware: We use the TidyBot++ holonomic mobile manipulator wu2024tidybot++ with two base cameras and a wrist-mounted fisheye camera. An onboard NUC handles real-time control, and an onboard GPU laptop runs policy inference.
  • Figure 4: Benchmarking Results. We evaluate HoMeR on six simulated and real-world tasks (top) that require spatial generalization, precision, and long-horizon reasoning. TV Remote and Sweep Trash are particularly challenging due to their multi-step nature. HoMeR consistently outperforms baselines that use only dense actions or decoupled base-arm control, highlighting the benefits of hybrid action modes and whole-body coordination. The performance of all methods is best understood through videos available https://homer-manip.github.io/#benchmark.
  • Figure 5: Generalization Results.HoMeR-Cond achieves strong generalization to unseen scenarios by combining salient point conditioning with point cloud augmentations (videos https://homer-manip.github.io/#generalization). Without augmentations (HoMeR-Cond-NoAugs) or conditioning (HoMeR), performance drops with distractors or novel appearances.
  • ...and 5 more figures