Table of Contents
Fetching ...

RoboPaint: From Human Demonstration to Any Robot and Any View

Jiacheng Fan, Zhiyue Zhao, Yiqian Zhang, Chao Chen, Peide Wang, Hengdi Zhang, Zhengxue Cheng

TL;DR

RoboPaint tackles the data bottleneck for scalable dexterous manipulation by turning human demonstrations into robot-valid data without direct teleoperation via a Real-Sim-Real pipeline. It combines multimodal human data collection, tactile-aware Dex-Tactile retargeting in 3D space, and photorealistic scene rendering in Isaac Sim to produce large-scale robot training data across embodiments and viewpoints. The approach yields an average real-world task success of $84\%$ for retargeted dex-hand replay and $80\%$ average success for VLA policies trained on the painted data across pick-and-place, pushing, and pouring, illustrating strong transfer. This pipeline provides a scalable, cost-effective alternative to teleoperation for building high-fidelity, cross-embodiment datasets, enabling broader deployment of VLA agents in complex dexterous tasks.

Abstract

Acquiring large-scale, high-fidelity robot demonstration data remains a critical bottleneck for scaling Vision-Language-Action (VLA) models in dexterous manipulation. We propose a Real-Sim-Real data collection and data editing pipeline that transforms human demonstrations into robot-executable, environment-specific training data without direct robot teleoperation. Standardized data collection rooms are built to capture multimodal human demonstrations (synchronized 3 RGB-D videos, 11 RGB videos, 29-DoF glove joint angles, and 14-channel tactile signals). Based on these human demonstrations, we introduce a tactile-aware retargeting method that maps human hand states to robot dex-hand states via geometry and force-guided optimization. Then the retargeted robot trajectories are rendered in a photorealistic Isaac Sim environment to build robot training data. Real world experiments have demonstrated: (1) The retargeted dex-hand trajectories achieve an 84\% success rate across 10 diverse object manipulation tasks. (2) VLA policies (Pi0.5) trained exclusively on our generated data achieve 80\% average success rate on three representative tasks, i.e., pick-and-place, pushing and pouring. To conclude, robot training data can be efficiently "painted" from human demonstrations using our real-sim-real data pipeline. We offer a scalable, cost-effective alternative to teleoperation with minimal performance loss for complex dexterous manipulation.

RoboPaint: From Human Demonstration to Any Robot and Any View

TL;DR

RoboPaint tackles the data bottleneck for scalable dexterous manipulation by turning human demonstrations into robot-valid data without direct teleoperation via a Real-Sim-Real pipeline. It combines multimodal human data collection, tactile-aware Dex-Tactile retargeting in 3D space, and photorealistic scene rendering in Isaac Sim to produce large-scale robot training data across embodiments and viewpoints. The approach yields an average real-world task success of for retargeted dex-hand replay and average success for VLA policies trained on the painted data across pick-and-place, pushing, and pouring, illustrating strong transfer. This pipeline provides a scalable, cost-effective alternative to teleoperation for building high-fidelity, cross-embodiment datasets, enabling broader deployment of VLA agents in complex dexterous tasks.

Abstract

Acquiring large-scale, high-fidelity robot demonstration data remains a critical bottleneck for scaling Vision-Language-Action (VLA) models in dexterous manipulation. We propose a Real-Sim-Real data collection and data editing pipeline that transforms human demonstrations into robot-executable, environment-specific training data without direct robot teleoperation. Standardized data collection rooms are built to capture multimodal human demonstrations (synchronized 3 RGB-D videos, 11 RGB videos, 29-DoF glove joint angles, and 14-channel tactile signals). Based on these human demonstrations, we introduce a tactile-aware retargeting method that maps human hand states to robot dex-hand states via geometry and force-guided optimization. Then the retargeted robot trajectories are rendered in a photorealistic Isaac Sim environment to build robot training data. Real world experiments have demonstrated: (1) The retargeted dex-hand trajectories achieve an 84\% success rate across 10 diverse object manipulation tasks. (2) VLA policies (Pi0.5) trained exclusively on our generated data achieve 80\% average success rate on three representative tasks, i.e., pick-and-place, pushing and pouring. To conclude, robot training data can be efficiently "painted" from human demonstrations using our real-sim-real data pipeline. We offer a scalable, cost-effective alternative to teleoperation with minimal performance loss for complex dexterous manipulation.
Paper Structure (21 sections, 12 equations, 9 figures, 2 tables)

This paper contains 21 sections, 12 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: RoboPaint data pipeline. Our data pipeline can paint robot demonstration from multimodal data collected from human demonstration. The cross-embodiment problem between human and robot is resolved by our Dex-Tactile retargeting method.
  • Figure 2: System Overview of the Real-Sim-Real Pipeline. Our framework begins with high-precision human data collection using instrumented gloves under our multiview Data-Acquisition-Room. The raw data is then processed through object pose estimation and Dex-Tactile joint retargeting to obtain the movement of target robot embodiment and object. The deployment scene is then reconstructed via 3D Gaussian Splatting and imported into simulation environment. Finally, we drive robot and objects accordingly and record robot demonstration from arbitrary view.
  • Figure 3: Simulation validation of our data collection pipeline and Real-Sim-Real data processing pipeline. The first row is the raw images of collected data. The second row is the re-projection results of glove, object and tactile point. The last row is the rendered images using our Real-Sim-Real data pipeline.
  • Figure 4: Contact error in the simulation enviroment and success rates of real-world replay for different objects. Each object was tested on 10 demonstrations using retargeted DexH13 joint angles and retargted dex-hand end trajectories.
  • Figure 5: Data augmentation examples showing background replacement and object material changes. Top row shows the original scene, middle row demonstrates background change, and bottom row illustrates object material variation.
  • ...and 4 more figures