UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers

Huy Ha; Yihuai Gao; Zipeng Fu; Jie Tan; Shuran Song

UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers

Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, Shuran Song

TL;DR

UMI-on-Legs presents a scalable framework for mobile manipulation on quadrupeds by pairing real-world demonstrations collected with a hand-held gripper and simulation-trained whole-body controllers that track end-effector trajectories in a task frame. The key idea is an embodiment-agnostic interface: a diffusion-based manipulation policy proposes end-effector targets, which a high-frequency WBC executes on the robot to realize the task. Across dynamic tossing, pushing, and cross-embodiment cup rearrangement, the approach achieves robust performance (over 70% success in real and simulated tasks) and demonstrates zero-shot transfer of a policy trained for a fixed-base arm to a legged platform. The work highlights a practical path to port expressive manipulation skills to mobile, dynamic robots by decoupling task-space planning from embodiment-specific control and using lightweight, real-world data collection with accessible sensing.

Abstract

We introduce UMI-on-Legs, a new framework that combines real-world and simulation data for quadruped manipulation systems. We scale task-centric data collection in the real world using a hand-held gripper (UMI), providing a cheap way to demonstrate task-relevant manipulation skills without a robot. Simultaneously, we scale robot-centric data in simulation by training whole-body controller for task-tracking without task simulation setups. The interface between these two policies is end-effector trajectories in the task frame, inferred by the manipulation policy and passed to the whole-body controller for tracking. We evaluate UMI-on-Legs on prehensile, non-prehensile, and dynamic manipulation tasks, and report over 70% success rate on all tasks. Lastly, we demonstrate the zero-shot cross-embodiment deployment of a pre-trained manipulation policy checkpoint from prior work, originally intended for a fixed-base robot arm, on our quadruped system. We believe this framework provides a scalable path towards learning expressive manipulation skills on dynamic robot embodiments. Please checkout our website for robot videos, code, and data: https://umi-on-legs.github.io

UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers

TL;DR

Abstract

Paper Structure (35 sections, 7 figures, 7 tables)

This paper contains 35 sections, 7 figures, 7 tables.

Introduction
Related Work
Method: Universal Manipulation Interface on Legged Robots
Manipulation Policy with Behavior Cloning
Whole-body Controller with Reinforcement Learning
System Integration
Experiments
Capability: Whole-Body Dynamic Tossing
Robustness: End-effector Reaching Leads to Robust Whole-body Pushing
Scalability: Plug-and-play Cross-Embodiment Manipulation Policies
Limitations
Conclusion
Acknowledgements
Things that did not work
Privileged policy distillation and observation history.
...and 20 more sections

Figures (7)

Figure 1: UMI-on-Legs. We achieve fully autonomous, expressive, real-world manipulation skills on quadrupeds by combining real-world demonstrations using hand-held grippers (left) and simulation-trained whole-body controllers (right). Our framework allows the porting of existing "table-top" manipulation policies to mobile manipulation while enhancing mobility and power from the quadruped's legs.
Figure 2: In-the-wild Manipulation, on Legs. Featuring robust, low-latency iPhone odometry and onboard power/ compute, our mobile manipulation system is complete for in-the-wild mobile manipulation.
Figure 3: Method Overview. Our system takes as input RGB images from a GoPro and infers a camera-frame end-effector trajectory using a diffusion policy (a), trained using real-world UMI demonstrations. We transform this trajectory into the task-space, and use it as the interface to the WBC. This controller (c) outputs joint position targets at 50Hz, which PD controllers subsequently tracks.
Figure 4: Task- v.s. body-frame tracking. Our whole-body controller (WBC) learns to track the target trajectory in task-frame (a), effectively compensating base perturbations and, therefore, frees up the manipulation policy to focus on making task progress. In contrast, most of existing WBCs use body-frame tracking (b)fu2023deepliu2024visualpan2024roboduetportela2024learning, trained to follow base perturbations. In effect, they defer task-space tracking responsibilities to the low-rate manipulation policy and fail to react quickly to body perturbations.
Figure 5: Dynamic tossing requires dynamics whole-body coordination. Our controller (top row) discovers a strategy to toss reliably given its limited arm strength and body inertia, which involves three stages. First, as the arm accelerates forward, the back legs pops up, leading to a leap and toss motion (a, green). To prevent falling forwards, the robot tucks and curls its arms and legs inwards (b, orange), inducing a backwards torque. This backwards torque helps shift the front legs' contact points forwards in front of its center-of-mass, enabling a soft landing. In contrast, the controller with no preview information (bottom row) leaps in an attempt to follow the fast target acceleration forwards, but doesn't know where the target will go next, thus, dropping the ball. Please checkout https://umi-on-legs.github.io/ robot videos!
...and 2 more figures

UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers

TL;DR

Abstract

UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers

Authors

TL;DR

Abstract

Table of Contents

Figures (7)