Table of Contents
Fetching ...

FastUMI-100K: Advancing Data-driven Robotic Manipulation with a Large-scale UMI-style Dataset

Kehui Liu, Zhongjie Jia, Yang Li, Zhaxizhuoma, Pengan Chen, Song Liu, Xin Liu, Pingrui Zhang, Haoming Song, Xinyi Ye, Nieqing Cao, Zhigang Wang, Jia Zeng, Dong Wang, Yan Ding, Bin Zhao, Xuelong Li

TL;DR

The paper introduces FastUMI-100K, a large-scale, UMI-style multimodal dataset for robotic manipulation, comprising 100K+ long-horizon trajectories across 54 tasks and hundreds of objects captured via a hardware-agnostic FastUMI system. It combines end-effector states, multi-view wrist images, and textual annotations to enable cross-embodiment transfer and downstream learning for single-task imitation, cross-platform deployment, and Vision-Language-Action (VLA) model fine-tuning. The authors detail a robust data collection pipeline, precise multi-sensor temporal alignment, and a two-tier annotation framework, achieving high-quality, diverse data suitable for training robust policies. Experimental results show high policy success across baseline models and successful cross-platform transfer with simple coordinate mapping, validating the dataset's robustness and practicality for real-world manipulation. Overall, FastUMI-100K stands as a valuable resource to accelerate data-driven robotic manipulation research with scalable, generalizable, and cross-embodiment data.

Abstract

Data-driven robotic manipulation learning depends on large-scale, high-quality expert demonstration datasets. However, existing datasets, which primarily rely on human teleoperated robot collection, are limited in terms of scalability, trajectory smoothness, and applicability across different robotic embodiments in real-world environments. In this paper, we present FastUMI-100K, a large-scale UMI-style multimodal demonstration dataset, designed to overcome these limitations and meet the growing complexity of real-world manipulation tasks. Collected by FastUMI, a novel robotic system featuring a modular, hardware-decoupled mechanical design and an integrated lightweight tracking system, FastUMI-100K offers a more scalable, flexible, and adaptable solution to fulfill the diverse requirements of real-world robot demonstration data. Specifically, FastUMI-100K contains over 100K+ demonstration trajectories collected across representative household environments, covering 54 tasks and hundreds of object types. Our dataset integrates multimodal streams, including end-effector states, multi-view wrist-mounted fisheye images and textual annotations. Each trajectory has a length ranging from 120 to 500 frames. Experimental results demonstrate that FastUMI-100K enables high policy success rates across various baseline algorithms, confirming its robustness, adaptability, and real-world applicability for solving complex, dynamic manipulation challenges. The source code and dataset will be released in this link https://github.com/MrKeee/FastUMI-100K.

FastUMI-100K: Advancing Data-driven Robotic Manipulation with a Large-scale UMI-style Dataset

TL;DR

The paper introduces FastUMI-100K, a large-scale, UMI-style multimodal dataset for robotic manipulation, comprising 100K+ long-horizon trajectories across 54 tasks and hundreds of objects captured via a hardware-agnostic FastUMI system. It combines end-effector states, multi-view wrist images, and textual annotations to enable cross-embodiment transfer and downstream learning for single-task imitation, cross-platform deployment, and Vision-Language-Action (VLA) model fine-tuning. The authors detail a robust data collection pipeline, precise multi-sensor temporal alignment, and a two-tier annotation framework, achieving high-quality, diverse data suitable for training robust policies. Experimental results show high policy success across baseline models and successful cross-platform transfer with simple coordinate mapping, validating the dataset's robustness and practicality for real-world manipulation. Overall, FastUMI-100K stands as a valuable resource to accelerate data-driven robotic manipulation research with scalable, generalizable, and cross-embodiment data.

Abstract

Data-driven robotic manipulation learning depends on large-scale, high-quality expert demonstration datasets. However, existing datasets, which primarily rely on human teleoperated robot collection, are limited in terms of scalability, trajectory smoothness, and applicability across different robotic embodiments in real-world environments. In this paper, we present FastUMI-100K, a large-scale UMI-style multimodal demonstration dataset, designed to overcome these limitations and meet the growing complexity of real-world manipulation tasks. Collected by FastUMI, a novel robotic system featuring a modular, hardware-decoupled mechanical design and an integrated lightweight tracking system, FastUMI-100K offers a more scalable, flexible, and adaptable solution to fulfill the diverse requirements of real-world robot demonstration data. Specifically, FastUMI-100K contains over 100K+ demonstration trajectories collected across representative household environments, covering 54 tasks and hundreds of object types. Our dataset integrates multimodal streams, including end-effector states, multi-view wrist-mounted fisheye images and textual annotations. Each trajectory has a length ranging from 120 to 500 frames. Experimental results demonstrate that FastUMI-100K enables high policy success rates across various baseline algorithms, confirming its robustness, adaptability, and real-world applicability for solving complex, dynamic manipulation challenges. The source code and dataset will be released in this link https://github.com/MrKeee/FastUMI-100K.

Paper Structure

This paper contains 24 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of FastUMI-100K. We introduce FastUMI-100K, a large-scale UMI-style dataset. This dataset comprises 100K+ long-horizon trajectories, spanning 54 distinct tasks and over 100 real scenario manipulation objects. FastUMI-100K integrates multimodal data streams, including single-arm and dual-arm trajectories, multi-view wrist-mounted fisheye image, and fine-grained textual annotations. Benefitting from our embodiment-agnostic design, FastUMI-100K can be used across diverse robotics.
  • Figure 2: Left figure shows the dual-arm FastUMI hardware data collection device developed in our dataset. Right figure shows the schematic diagram of multi-sensor temporal alignment at 20 hz frequency.
  • Figure 3: The pipeline of data collection and processing.
  • Figure 4: Data statistical chart of FastUMI-100K. Figure (a) shows our manipulation of objects in real-world scenarios using different robots. Figure (b) presents the classification of the number of objects across various scenarios. Figure (c) records the distribution of the number of six different types of tasks in the dataset. In Figures (d) and (e), five types of dual-arm tasks from FastUMI-100K and AgiBot are selected, and the comparison of the average linear velocity and angular velocity of their respective data is presented, demonstrating the flexibility and human-like qualities of FastUMI-100K when performing complex long-horizon tasks.
  • Figure 5: All 16 tasks evaluated in our experiment.