Table of Contents
Fetching ...

FastUMI: A Scalable and Hardware-Independent Universal Manipulation Interface with Dataset

Zhaxizhuoma, Kehui Liu, Chuyue Guan, Zhongjie Jia, Ziniu Wu, Xin Liu, Tianyu Wang, Shuai Liang, Pengan Chen, Pingrui Zhang, Haoming Song, Delin Qu, Dong Wang, Zhigang Wang, Nieqing Cao, Yan Ding, Bin Zhao, Xuelong Li

TL;DR

FastUMI addresses the scarcity and cost of real-world robotic manipulation data by delivering a hardware-decoupled, plug-and-play data collection system that couples handheld demonstrations with robot-mounted execution. It replaces the original VIO pipeline with RealSense T265 tracking, scaffolds a robust software pipeline for data collection and verification, and opens a 10k+ demonstration dataset spanning 22 tasks to accelerate imitation-learning progress. The authors introduce algorithmic adaptations—Smooth-ACT and PoseACT for first-person perspectives, and Depth-Enhanced DP—along with a dynamic error-compensation mechanism to maintain alignment across diverse hardware. The result is a scalable, cost-effective platform that sustains robust performance across varied manipulation scenarios, demonstrated by significant improvements in depth-sensitive tasks and broad cross-platform transfer potential. The open dataset and modular framework are positioned to advance data-driven robotic learning across real-world, diverse environments.

Abstract

Real-world manipulation data involving robotic arms is crucial for developing generalist action policies, yet such data remains scarce since existing data collection methods are hindered by high costs, hardware dependencies, and complex setup requirements. In this work, we introduce FastUMI, a substantial redesign of the Universal Manipulation Interface (UMI) system that addresses these challenges by enabling rapid deployment, simplifying hardware-software integration, and delivering robust performance in real-world data acquisition. Compared with UMI, FastUMI has several advantages: 1) It adopts a decoupled hardware design and incorporates extensive mechanical modifications, removing dependencies on specialized robotic components while preserving consistent observation perspectives. 2) It also refines the algorithmic pipeline by replacing complex Visual-Inertial Odometry (VIO) implementations with an off-the-shelf tracking module, significantly reducing deployment complexity while maintaining accuracy. 3) FastUMI includes an ecosystem for data collection, verification, and integration with both established and newly developed imitation learning algorithms, accelerating policy learning advancement. Additionally, we have open-sourced a high-quality dataset of over 10,000 real-world demonstration trajectories spanning 22 everyday tasks, forming one of the most diverse UMI-like datasets to date. Experimental results confirm that FastUMI facilitates rapid deployment, reduces operational costs and labor demands, and maintains robust performance across diverse manipulation scenarios, thereby advancing scalable data-driven robotic learning.

FastUMI: A Scalable and Hardware-Independent Universal Manipulation Interface with Dataset

TL;DR

FastUMI addresses the scarcity and cost of real-world robotic manipulation data by delivering a hardware-decoupled, plug-and-play data collection system that couples handheld demonstrations with robot-mounted execution. It replaces the original VIO pipeline with RealSense T265 tracking, scaffolds a robust software pipeline for data collection and verification, and opens a 10k+ demonstration dataset spanning 22 tasks to accelerate imitation-learning progress. The authors introduce algorithmic adaptations—Smooth-ACT and PoseACT for first-person perspectives, and Depth-Enhanced DP—along with a dynamic error-compensation mechanism to maintain alignment across diverse hardware. The result is a scalable, cost-effective platform that sustains robust performance across varied manipulation scenarios, demonstrated by significant improvements in depth-sensitive tasks and broad cross-platform transfer potential. The open dataset and modular framework are positioned to advance data-driven robotic learning across real-world, diverse environments.

Abstract

Real-world manipulation data involving robotic arms is crucial for developing generalist action policies, yet such data remains scarce since existing data collection methods are hindered by high costs, hardware dependencies, and complex setup requirements. In this work, we introduce FastUMI, a substantial redesign of the Universal Manipulation Interface (UMI) system that addresses these challenges by enabling rapid deployment, simplifying hardware-software integration, and delivering robust performance in real-world data acquisition. Compared with UMI, FastUMI has several advantages: 1) It adopts a decoupled hardware design and incorporates extensive mechanical modifications, removing dependencies on specialized robotic components while preserving consistent observation perspectives. 2) It also refines the algorithmic pipeline by replacing complex Visual-Inertial Odometry (VIO) implementations with an off-the-shelf tracking module, significantly reducing deployment complexity while maintaining accuracy. 3) FastUMI includes an ecosystem for data collection, verification, and integration with both established and newly developed imitation learning algorithms, accelerating policy learning advancement. Additionally, we have open-sourced a high-quality dataset of over 10,000 real-world demonstration trajectories spanning 22 everyday tasks, forming one of the most diverse UMI-like datasets to date. Experimental results confirm that FastUMI facilitates rapid deployment, reduces operational costs and labor demands, and maintains robust performance across diverse manipulation scenarios, thereby advancing scalable data-driven robotic learning.
Paper Structure (33 sections, 9 equations, 11 figures, 7 tables)

This paper contains 33 sections, 9 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Physical prototypes of FastUMI. Left: The handheld device, used to collect demonstration data from human operators, includes a GoPro② for visual feedback, a RealSense T265① for end-effector pose tracking, fingertip markers④⑤ to measure the gripper aperture, and a top cover③ to secure both the GoPro and T265. Middle: A robot-mounted device, used for executing learned policies on the robotic arm, mirrors the handheld configuration. It features an ISO-standard-compatible camera mounting solution (including gopro mount⑥, extension arms⑦⑧, and flange plate⑨) that adapts to varying arm and gripper geometries. This design maintains consistent GoPro perspectives across different setups, enabling direct transfer of human demonstration views to autonomous robotic executions. Right: FastUMI can be easily deployed on various robotic arms and grippers. To distinguish FastUMI's hardware configuration from that of the original UMI, we employ a color-coding scheme.
  • Figure 3: Visual alignment between the handheld device (Left) and the robot-mounted device (Right). The two views demonstrate the consistent positioning of the GoPro's fisheye lens image, with the bottom of the gripper's fingertips aligned to the red dashed lines.
  • Figure 4: Our plug-in fingertip design integrated with the xArm Gripper; The effective length of the xArm Gripper changes by approximately 1 centimeter between fully closed and open positions, potentially causing misalignment when transferring demonstrations.
  • Figure 5: Left: The blue 3D-printed groove on the table, serving as a clear visual reference to aid loop closure. Right: The T265's trajectory in RVIZ, illustrating alignment with the initial reference, highlighted as a green dashed box, after revisiting the blue groove.
  • Figure 6: Illustration of the offset $\Delta_{c2g}$ from the T265 center to the gripper center, and the gripper center pose $\bigl(\mathbf{p}_{b2g}, \mathbf{R}_{b2g}\bigr)$ in the robot base frame.
  • ...and 6 more figures