Table of Contents
Fetching ...

How to Train Your Robots? The Impact of Demonstration Modality on Imitation Learning

Haozhuo Li, Yuchen Cui, Dorsa Sadigh

TL;DR

This paper investigates how demonstration modality affects imitation learning for robot manipulation by comparing kinesthetic teaching, VR teleoperation, and spacemouse teleoperation across three tasks using a diffusion-based imitation policy. It finds that kinesthetic demonstrations yield the strongest policy performance and are most intuitive, but are burdensome for large-scale data collection, while teleoperation provides broader state coverage. A simple hybrid data collection strategy that blends kinesthetic with teleoperation data achieves about 20% higher success rates on average than single-modality data. The findings offer practical guidance for scalable imitation-learning data collection and suggest avenues for optimizing modality mix in real-world deployments.

Abstract

Imitation learning is a promising approach for learning robot policies with user-provided data. The way demonstrations are provided, i.e., demonstration modality, influences the quality of the data. While existing research shows that kinesthetic teaching (physically guiding the robot) is preferred by users for the intuitiveness and ease of use, the majority of existing manipulation datasets were collected through teleoperation via a VR controller or spacemouse. In this work, we investigate how different demonstration modalities impact downstream learning performance as well as user experience. Specifically, we compare low-cost demonstration modalities including kinesthetic teaching, teleoperation with a VR controller, and teleoperation with a spacemouse controller. We experiment with three table-top manipulation tasks with different motion constraints. We evaluate and compare imitation learning performance using data from different demonstration modalities, and collected subjective feedback on user experience. Our results show that kinesthetic teaching is rated the most intuitive for controlling the robot and provides cleanest data for best downstream learning performance. However, it is not preferred as the way for large-scale data collection due to the physical load. Based on such insight, we propose a simple data collection scheme that relies on a small number of kinesthetic demonstrations mixed with data collected through teleoperation to achieve the best overall learning performance while maintaining low data-collection effort.

How to Train Your Robots? The Impact of Demonstration Modality on Imitation Learning

TL;DR

This paper investigates how demonstration modality affects imitation learning for robot manipulation by comparing kinesthetic teaching, VR teleoperation, and spacemouse teleoperation across three tasks using a diffusion-based imitation policy. It finds that kinesthetic demonstrations yield the strongest policy performance and are most intuitive, but are burdensome for large-scale data collection, while teleoperation provides broader state coverage. A simple hybrid data collection strategy that blends kinesthetic with teleoperation data achieves about 20% higher success rates on average than single-modality data. The findings offer practical guidance for scalable imitation-learning data collection and suggest avenues for optimizing modality mix in real-world deployments.

Abstract

Imitation learning is a promising approach for learning robot policies with user-provided data. The way demonstrations are provided, i.e., demonstration modality, influences the quality of the data. While existing research shows that kinesthetic teaching (physically guiding the robot) is preferred by users for the intuitiveness and ease of use, the majority of existing manipulation datasets were collected through teleoperation via a VR controller or spacemouse. In this work, we investigate how different demonstration modalities impact downstream learning performance as well as user experience. Specifically, we compare low-cost demonstration modalities including kinesthetic teaching, teleoperation with a VR controller, and teleoperation with a spacemouse controller. We experiment with three table-top manipulation tasks with different motion constraints. We evaluate and compare imitation learning performance using data from different demonstration modalities, and collected subjective feedback on user experience. Our results show that kinesthetic teaching is rated the most intuitive for controlling the robot and provides cleanest data for best downstream learning performance. However, it is not preferred as the way for large-scale data collection due to the physical load. Based on such insight, we propose a simple data collection scheme that relies on a small number of kinesthetic demonstrations mixed with data collected through teleoperation to achieve the best overall learning performance while maintaining low data-collection effort.

Paper Structure

This paper contains 6 sections, 6 equations, 11 figures.

Figures (11)

  • Figure 1: Demonstration modalities under study. Kinesthetic teaching controls precise joint poses; and teleoperation controls the delta pose of the robot's end-effector: VR provides a direct spatial mapping of the trajectory, while spacemouse allows the user to command velocity through buttons.
  • Figure 2: Popular demonstration modalities. Composition of human demonstration modalities present in the OpenXE dataset open_x_embodiment_rt_x_2023.
  • Figure 3: Action discrepancy. Replaying recorded end-effector pose may not recover the desired action, especially when contact force is present.
  • Figure 4: Tasks with varying motion constraints. The three selected tasks each has a different type of motion constraint (e.g. constrained linear, large rotation, and exerting contact force) to represent a broad range of tasks.
  • Figure 5: Policy performance. Success rates of policies learned for each task under different demonstration modality. Kinesthetic data leads to best-performing models in Open Drawer and Flip Glass but underperforms in Push Sanitizer where strong contact force is required to complete the task.
  • ...and 6 more figures