Table of Contents
Fetching ...

ARCap: Collecting High-quality Human Demonstrations for Robot Learning with Augmented Reality Feedback

Sirui Chen, Chen Wang, Kaden Nguyen, Li Fei-Fei, C. Karen Liu

TL;DR

ARCap tackles the scalability challenge of imitation learning data by providing real-time AR feedback that visualizes and retargets human motion to diverse robot embodiments while performing collision checks. The system supports cross-embodiment data collection and uses a diffusion-based imitation-learning pipeline trained on ARCap data, demonstrated through cluttered-object manipulation and long-horizon tasks. User studies and real-robot experiments show ARCap improves data quality, reduces collision and kinematic violations, and enables successful policies across different end-effectors. The work offers an open-source, portable solution that broadens access to robot learning, with potential extensions to mobile humanoids and guided data collection via language models.

Abstract

Recent progress in imitation learning from human demonstrations has shown promising results in teaching robots manipulation skills. To further scale up training datasets, recent works start to use portable data collection devices without the need for physical robot hardware. However, due to the absence of on-robot feedback during data collection, the data quality depends heavily on user expertise, and many devices are limited to specific robot embodiments. We propose ARCap, a portable data collection system that provides visual feedback through augmented reality (AR) and haptic warnings to guide users in collecting high-quality demonstrations. Through extensive user studies, we show that ARCap enables novice users to collect robot-executable data that matches robot kinematics and avoids collisions with the scenes. With data collected from ARCap, robots can perform challenging tasks, such as manipulation in cluttered environments and long-horizon cross-embodiment manipulation. ARCap is fully open-source and easy to calibrate; all components are built from off-the-shelf products. More details and results can be found on our website: https://stanford-tml.github.io/ARCap

ARCap: Collecting High-quality Human Demonstrations for Robot Learning with Augmented Reality Feedback

TL;DR

ARCap tackles the scalability challenge of imitation learning data by providing real-time AR feedback that visualizes and retargets human motion to diverse robot embodiments while performing collision checks. The system supports cross-embodiment data collection and uses a diffusion-based imitation-learning pipeline trained on ARCap data, demonstrated through cluttered-object manipulation and long-horizon tasks. User studies and real-robot experiments show ARCap improves data quality, reduces collision and kinematic violations, and enables successful policies across different end-effectors. The work offers an open-source, portable solution that broadens access to robot learning, with potential extensions to mobile humanoids and guided data collection via language models.

Abstract

Recent progress in imitation learning from human demonstrations has shown promising results in teaching robots manipulation skills. To further scale up training datasets, recent works start to use portable data collection devices without the need for physical robot hardware. However, due to the absence of on-robot feedback during data collection, the data quality depends heavily on user expertise, and many devices are limited to specific robot embodiments. We propose ARCap, a portable data collection system that provides visual feedback through augmented reality (AR) and haptic warnings to guide users in collecting high-quality demonstrations. Through extensive user studies, we show that ARCap enables novice users to collect robot-executable data that matches robot kinematics and avoids collisions with the scenes. With data collected from ARCap, robots can perform challenging tasks, such as manipulation in cluttered environments and long-horizon cross-embodiment manipulation. ARCap is fully open-source and easy to calibrate; all components are built from off-the-shelf products. More details and results can be found on our website: https://stanford-tml.github.io/ARCap

Paper Structure

This paper contains 15 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: ARCap System Overview. (a) Collect human hand motion data. (b) Provide real-time AR feedback, visualizing a virtual robot retargeted to the human hand in AR display. (c) Rollout robot policies trained with the collected data.
  • Figure 2: Visualization of AR Feedback. (a) Normal data recording: the red frame indicates visible region of the RGB-D camera. (b) Collision warning: when the virtual robot collides with the environment, the controller on the human gloves vibrates, and the frame blinks blue. (c) Fast motion warning: when the user moves faster than the robot's speed limits, the frame turns yellow. (d) Users can check if target objects are within camera's view during data collection.
  • Figure 3: Cross-Embodiment Data Collection. (a) ARCap can collect data for parallel-jaw grippers by guiding the user to form their hands into a gripper-like shape. If the user changes the hand gesture, the retargeting error will be large. (b) For a multi-finger dexterous hand, ARCap retargets the robot's fingertips to match the human fingertips, with the robot's wrist orientation determined by the orientation of the controller mounted on the user's gloves.
  • Figure 4: ARCap System Layout. The user wears an AR headset and motion capture gloves, with controllers mounted on the gloves for tracking the 6D pose of the palms. Data is stored on a laptop carried in the backpack.
  • Figure 5: AR-based Camera Calibration. When calibrating the camera, users align the virtual robot's base with the actual robot's base.
  • ...and 3 more figures