Table of Contents
Fetching ...

AR2-D2:Training a Robot Without a Robot

Jiafei Duan, Yi Ru Wang, Mohit Shridhar, Dieter Fox, Ranjay Krishna

TL;DR

AR2-D2 introduces a mobile AR-based framework to collect robot demonstrations without real robots or user training. By projecting an AR robot into the scene and using depth data, users can demonstrate object manipulation, which is transformed into training data for imitational policies. The approach enables training real-robot manipulators on personalized objects with as few as five demonstrations and brief finetuning, achieving performance on par with demonstrations collected on real robots. User studies show the interface is intuitive and faster than traditional methods, suggesting AR2-D2 can democratize robot training beyond lab environments.

Abstract

Diligently gathered human demonstrations serve as the unsung heroes empowering the progression of robot learning. Today, demonstrations are collected by training people to use specialized controllers, which (tele-)operate robots to manipulate a small number of objects. By contrast, we introduce AR2-D2: a system for collecting demonstrations which (1) does not require people with specialized training, (2) does not require any real robots during data collection, and therefore, (3) enables manipulation of diverse objects with a real robot. AR2-D2 is a framework in the form of an iOS app that people can use to record a video of themselves manipulating any object while simultaneously capturing essential data modalities for training a real robot. We show that data collected via our system enables the training of behavior cloning agents in manipulating real objects. Our experiments further show that training with our AR data is as effective as training with real-world robot demonstrations. Moreover, our user study indicates that users find AR2-D2 intuitive to use and require no training in contrast to four other frequently employed methods for collecting robot demonstrations.

AR2-D2:Training a Robot Without a Robot

TL;DR

AR2-D2 introduces a mobile AR-based framework to collect robot demonstrations without real robots or user training. By projecting an AR robot into the scene and using depth data, users can demonstrate object manipulation, which is transformed into training data for imitational policies. The approach enables training real-robot manipulators on personalized objects with as few as five demonstrations and brief finetuning, achieving performance on par with demonstrations collected on real robots. User studies show the interface is intuitive and faster than traditional methods, suggesting AR2-D2 can democratize robot training beyond lab environments.

Abstract

Diligently gathered human demonstrations serve as the unsung heroes empowering the progression of robot learning. Today, demonstrations are collected by training people to use specialized controllers, which (tele-)operate robots to manipulate a small number of objects. By contrast, we introduce AR2-D2: a system for collecting demonstrations which (1) does not require people with specialized training, (2) does not require any real robots during data collection, and therefore, (3) enables manipulation of diverse objects with a real robot. AR2-D2 is a framework in the form of an iOS app that people can use to record a video of themselves manipulating any object while simultaneously capturing essential data modalities for training a real robot. We show that data collected via our system enables the training of behavior cloning agents in manipulating real objects. Our experiments further show that training with our AR data is as effective as training with real-world robot demonstrations. Moreover, our user study indicates that users find AR2-D2 intuitive to use and require no training in contrast to four other frequently employed methods for collecting robot demonstrations.
Paper Structure (10 sections, 5 figures, 2 tables)

This paper contains 10 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: AR2-D2collects robot demonstrations without needing a real robot. (Left) Using AR2-D2, the user captures a video manipulating an object with their arm. AR2-D2 projects an operational URDF of an AR Franka Panda robot arm into a physical environment. It uses a hand-pose tracking algorithm to move the AR robot's end effector to align with and mirror the 6D pose of the human hand. (Middle) With this video demonstration, we train a perceiver-actor agent and (Right) deploy the agent on a real-world robot to demonstrate its ability to learn from AR demonstrations.
  • Figure 2: AR2-D2 collection process. (Left) Once the user records themselves manipulating an object, AR2-D2 extracts the following information: 6D hand pose, hand state, RGB frames and depth estimations. We replace the hand with an AR robot, aligning its motions to align its end effector with the hand's. (Right) We create a 3D voxelized representation over time from the extracted information. This 3D representation is used to train a PerActshridhar2022perceiver agent. We also use the generated video to train an image-conditioned BC agent shridhar2022perceiver.
  • Figure 3: Evaluating AR2-D2 with real users. We conduct an extensive within-subjects user study, comparing AR2-D2 against $4$ alternative collection techniques: keyboard & mouse, 3D mouse (6-DoF), kinesthetic teaching, and HTC Hive controller. (Left) The first two techniques control a simulated Franka Panda while the next two a real robot; AR2-D2 manipulates an AR robot in the real world. Participants used these techniques to collect demonstrations for two tasks: (1) pick up and move a cube to a designated location and (2) stack three cubes. (Right) (a, b) We find that participants spend significantly less time (with an average of 22.1 and 29.5 seconds across the two tasks) using our system than others versus the next best (kinesthetic teaching with an average of 41.6 and 61.4 seconds). (c, d) We show that participants are able to successfully collect a demonstration with the same rate of success using our system as kinesthetic teaching, both of which have significantly higher success rate versus others.
  • Figure 4: Evaluating AR2-D2 data by training a real robot to manipulate real objects. We employ AR2-D2 as a tool for gathering a diverse array of manipulations encompassing three fundamental actions, involving a wide variety of customized objects. These manipulations range from performing precise actions such as pressing a computer mouse or a Minecraft torch button at specific locations, to pushing small LEGO train toys towards table-sized drawers, and even encompassing the ability to pick up objects varying from chess pieces to takeaway bags. By leveraging a limited number of real-world action demonstrations conducted with random dummy objects and fine-tuning for 3,000 iterations which is equivalent to 10 minutes of training, we have achieved the capacity to apply the PerAct framework in manipulating all these personalized objects with broad generalization.
  • Figure 5: Analysis on Fine-tuning. We conducted a diagnostic analysis to determine the optimal number of iterations and demonstrations required. By varying the number of demonstrations and iterations for fine-tuning, we found that using 5 demonstrations and 3,000 iterations yielded the best results.