How to Train Your Robots? The Impact of Demonstration Modality on Imitation Learning
Haozhuo Li, Yuchen Cui, Dorsa Sadigh
TL;DR
This paper investigates how demonstration modality affects imitation learning for robot manipulation by comparing kinesthetic teaching, VR teleoperation, and spacemouse teleoperation across three tasks using a diffusion-based imitation policy. It finds that kinesthetic demonstrations yield the strongest policy performance and are most intuitive, but are burdensome for large-scale data collection, while teleoperation provides broader state coverage. A simple hybrid data collection strategy that blends kinesthetic with teleoperation data achieves about 20% higher success rates on average than single-modality data. The findings offer practical guidance for scalable imitation-learning data collection and suggest avenues for optimizing modality mix in real-world deployments.
Abstract
Imitation learning is a promising approach for learning robot policies with user-provided data. The way demonstrations are provided, i.e., demonstration modality, influences the quality of the data. While existing research shows that kinesthetic teaching (physically guiding the robot) is preferred by users for the intuitiveness and ease of use, the majority of existing manipulation datasets were collected through teleoperation via a VR controller or spacemouse. In this work, we investigate how different demonstration modalities impact downstream learning performance as well as user experience. Specifically, we compare low-cost demonstration modalities including kinesthetic teaching, teleoperation with a VR controller, and teleoperation with a spacemouse controller. We experiment with three table-top manipulation tasks with different motion constraints. We evaluate and compare imitation learning performance using data from different demonstration modalities, and collected subjective feedback on user experience. Our results show that kinesthetic teaching is rated the most intuitive for controlling the robot and provides cleanest data for best downstream learning performance. However, it is not preferred as the way for large-scale data collection due to the physical load. Based on such insight, we propose a simple data collection scheme that relies on a small number of kinesthetic demonstrations mixed with data collected through teleoperation to achieve the best overall learning performance while maintaining low data-collection effort.
