Table of Contents
Fetching ...

BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation

Chenhao Yu, Hongwu Wang, Youhao Hu, Jiachen Zhang, Yuanyuan Li, Shaqi Luo

Abstract

High-quality data collection is a fundamental cornerstone for training humanoid whole-body visuomotor policies. Current data acquisition paradigms predominantly rely on robot teleoperation, which is often hindered by limited hardware accessibility and low operational efficiency. Inspired by the Universal Manipulation Interface (UMI), we propose BifrostUMI, a portable, efficient, and robot-free data collection framework tailored for humanoid robots. BifrostUMI leverages lightweight VR devices to capture human demonstrations as sparse keypoint trajectories while simultaneously recording wrist-mounted visual data. These multimodal data are subsequently utilized to train a high-level policy network that predicts future keypoint trajectories conditioned on the captured visual features. Through a robust keypoint retargeting pipeline, keypoint trajectories are precisely mapped onto the robot's morphology and executed via a whole-body controller. This approach enables the seamless transfer of diverse and agile behaviors from natural human demonstrations to humanoid embodiments. We demonstrate the efficacy and versatility of the proposed framework across two distinct experimental scenarios.

BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation

Abstract

High-quality data collection is a fundamental cornerstone for training humanoid whole-body visuomotor policies. Current data acquisition paradigms predominantly rely on robot teleoperation, which is often hindered by limited hardware accessibility and low operational efficiency. Inspired by the Universal Manipulation Interface (UMI), we propose BifrostUMI, a portable, efficient, and robot-free data collection framework tailored for humanoid robots. BifrostUMI leverages lightweight VR devices to capture human demonstrations as sparse keypoint trajectories while simultaneously recording wrist-mounted visual data. These multimodal data are subsequently utilized to train a high-level policy network that predicts future keypoint trajectories conditioned on the captured visual features. Through a robust keypoint retargeting pipeline, keypoint trajectories are precisely mapped onto the robot's morphology and executed via a whole-body controller. This approach enables the seamless transfer of diverse and agile behaviors from natural human demonstrations to humanoid embodiments. We demonstrate the efficacy and versatility of the proposed framework across two distinct experimental scenarios.

Paper Structure

This paper contains 13 sections, 12 equations, 5 figures.

Figures (5)

  • Figure 2: BifrostUMI Data Acquisition System. The data acquisition platform consists of a Pico4-based motion capture setup, including two foot-mounted trackers and one waist-mounted tracker, together with two instrumented grippers, each equipped with a fisheye camera. The system synchronously records multimodal observations, including wrist-view images from the fisheye camera, human keypoint states obtained via the Pico SDK, and gripper aperture measurements derived from motor encoder readings. These heterogeneous data streams are jointly used to train a high-level policy, which is subsequently deployed for real-time control of robot motion.
  • Figure 3: BifrostUMI Hierarchical Visuomotor Control. BifrostUMI formulates humanoid visuomotor control as a three-stage hierarchy. A diffusion-based high-level policy infers task-space keypoint trajectories and gripper commands from wrist-view images and partial proprioception. The spatial keypoint retargeting bridge maps these commands to robot-native 36-dimensional robot-native motion representation, including root pose and joint configurations. A low-level whole-body controller then tracks the retargeted motion using proprioceptive feedback, enabling stable humanoid execution from robot-free demonstrations.
  • Figure 4: Conditional diffusion policy architecture. The left and right wrist-view RGB images are encoded by DINOv2 and fused with lower-body DoF states and the diffusion step into a global condition. Conditioned on this representation, the diffusion model predicts action trajectories for the left/right TCPs and body-support keypoints.
  • Figure 5: Spatial Keypoint Retargeting (SKR). SKR bridges high-level keypoint prediction and low-level whole-body control by converting five task-space keypoints, including the pelvis, two TCPs, and two feet, into robot-native whole-body references. Unlike global motion rescaling, SKR preserves metric spatial relationships among the keypoints and only scales the vertical pelvis-to-foot distance to compensate for human--robot height differences. The resulting inverse-kinematics solution provides executable joint-level motion commands for the humanoid robot.
  • Figure 6: Real-world evaluation of BifrostUMI on two humanoid manipulation tasks with a Unitree G1 robot. (a) Cluttered tabletop pick-and-place: the robot localizes, grasps, transfers, and places a piece of bread onto a target plate, demonstrating end-to-end transfer from robot-free VR--UMI demonstrations to physical humanoid execution. (b) Whole-body under-table waste disposal: the robot grasps a crumpled paper ball, steps backward, bends its knees and torso, and releases the object into a waste bin, demonstrating coordinated whole-body manipulation across the hands, waist, and legs. Numbered frames indicate the execution sequence, and red dashed circles highlight the task-relevant manipulation regions.