Robotic Imitation of Human Actions
Josua Spisak, Matthias Kerzel, Stefan Wermter
TL;DR
This paper tackles cross-body imitation by enabling a robot to imitate a human demonstration from a different perspective using only a single example. It fuses diffusion-based action segmentation with an open-vocabulary object detector to extract temporal and spatial task structure, which is then refined by symbolic planning and executed via inverse kinematics. The approach demonstrates promising accuracy in action segmentation and reasonably robust 3D localization, but faces challenges in object classification and grasp reliability due to occlusions and detector noise. Overall, it shows that rapid, demonstration-driven imitation is feasible in cluttered environments, reducing the need for extensive training data in robotic skill transfer.
Abstract
Imitation can allow us to quickly gain an understanding of a new task. Through a demonstration, we can gain direct knowledge about which actions need to be performed and which goals they have. In this paper, we introduce a new approach to imitation learning that tackles the challenges of a robot imitating a human, such as the change in perspective and body schema. Our approach can use a single human demonstration to abstract information about the demonstrated task, and use that information to generalise and replicate it. We facilitate this ability by a new integration of two state-of-the-art methods: a diffusion action segmentation model to abstract temporal information from the demonstration and an open vocabulary object detector for spatial information. Furthermore, we refine the abstracted information and use symbolic reasoning to create an action plan utilising inverse kinematics, to allow the robot to imitate the demonstrated action.
