Table of Contents
Fetching ...

UMIGen: A Unified Framework for Egocentric Point Cloud Generation and Cross-Embodiment Robotic Imitation Learning

Yan Huang, Shoujie Li, Xingting Li, Wenbo Ding

TL;DR

UMIGen addresses the data bottleneck in robotic imitation learning by enabling fast, low-cost collection of egocentric 3D observations and actions with a handheld Cloud-UMI device. It introduces visibility-aware optimization (VAO) to synthesize only points within the wrist camera's field of view, aligning synthetic data with real perceptual constraints. The approach demonstrates strong cross-embodiment generalization and rapid data generation in both simulation and real robots. This work reduces hardware and data-collection costs while enabling scalable, transferable visuomotor policies across embodiments.

Abstract

Data-driven robotic learning faces an obvious dilemma: robust policies demand large-scale, high-quality demonstration data, yet collecting such data remains a major challenge owing to high operational costs, dependence on specialized hardware, and the limited spatial generalization capability of current methods. The Universal Manipulation Interface (UMI) relaxes the strict hardware requirements for data collection, but it is restricted to capturing only RGB images of a scene and omits the 3D geometric information on which many tasks rely. Inspired by DemoGen, we propose UMIGen, a unified framework that consists of two key components: (1) Cloud-UMI, a handheld data collection device that requires no visual SLAM and simultaneously records point cloud observation-action pairs; and (2) a visibility-aware optimization mechanism that extends the DemoGen pipeline to egocentric 3D observations by generating only points within the camera's field of view. These two components enable efficient data generation that aligns with real egocentric observations and can be directly transferred across different robot embodiments without any post-processing. Experiments in both simulated and real-world settings demonstrate that UMIGen supports strong cross-embodiment generalization and accelerates data collection in diverse manipulation tasks.

UMIGen: A Unified Framework for Egocentric Point Cloud Generation and Cross-Embodiment Robotic Imitation Learning

TL;DR

UMIGen addresses the data bottleneck in robotic imitation learning by enabling fast, low-cost collection of egocentric 3D observations and actions with a handheld Cloud-UMI device. It introduces visibility-aware optimization (VAO) to synthesize only points within the wrist camera's field of view, aligning synthetic data with real perceptual constraints. The approach demonstrates strong cross-embodiment generalization and rapid data generation in both simulation and real robots. This work reduces hardware and data-collection costs while enabling scalable, transferable visuomotor policies across embodiments.

Abstract

Data-driven robotic learning faces an obvious dilemma: robust policies demand large-scale, high-quality demonstration data, yet collecting such data remains a major challenge owing to high operational costs, dependence on specialized hardware, and the limited spatial generalization capability of current methods. The Universal Manipulation Interface (UMI) relaxes the strict hardware requirements for data collection, but it is restricted to capturing only RGB images of a scene and omits the 3D geometric information on which many tasks rely. Inspired by DemoGen, we propose UMIGen, a unified framework that consists of two key components: (1) Cloud-UMI, a handheld data collection device that requires no visual SLAM and simultaneously records point cloud observation-action pairs; and (2) a visibility-aware optimization mechanism that extends the DemoGen pipeline to egocentric 3D observations by generating only points within the camera's field of view. These two components enable efficient data generation that aligns with real egocentric observations and can be directly transferred across different robot embodiments without any post-processing. Experiments in both simulated and real-world settings demonstrate that UMIGen supports strong cross-embodiment generalization and accelerates data collection in diverse manipulation tasks.

Paper Structure

This paper contains 6 sections, 10 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of UMIGen. (a) From a few wrist-view demonstrations, UMIGen generates diverse samples that generalize spatially and support transfer across robot embodiments sharing the same wrist viewpoint. (b) During augmentation, only the points within the camera’s field of view are kept. This makes the generated observations realistic and consistent with what the wrist-mounted camera can actually see. (c) Cloud-UMI, a low-cost handheld data collection device that fuses a depth sensor with a tracking module, eliminating the need for complex visual-SLAM or external motion capture systems. (d) Experiments and applications using UMIGen. Curved arrows trace the end-effector trajectory.
  • Figure 2: Overview of the dataset collection and generation pipeline. (a) The collection of observation–action pairs, where orange arrows transform the point cloud from the camera coordinate to the robot base coordinate, while blue arrows mark the corresponding 6D action pose. (b) The motion stage plans actions that bridge adjacent manipulation segments. The point cloud is cropped to the camera visible region and used as the generated observation. (c) The manipulation stage applies a transformation to all actions.
  • Figure 3: Illustration of two types of egocentric occlusions encountered during data collection. (Left) Object-Induced Occlusion: Large objects obstruct the camera's FoV, preventing visibility of surrounding workspace regions. (Right) Viewpoint-Induced Occlusion: The limited and task-dependent viewpoint of a wrist-mounted camera causes certain key elements of the task to fall outside the view during different stages of execution.
  • Figure 4: The simulation benchmark comprises five tasks (Lift, Stack, Square, Can, Close Drawer) and four robot arms (Panda, UR5e, Kinova 3, IIWA).
  • Figure 5: Overview of the experimental setup and spatial generalization configuration. (a) Real-world hardware platform used for experiment tasks. (b) Workspace layout for spatial generalization evaluation. Red markers denote the generalization locations used to evaluate generalization performance. (c) Visualization of demonstration generation configuration. Green markers denote the location of source demonstrations, orange markers indicate candidate locations of generated demonstrations.
  • ...and 2 more figures