Table of Contents
Fetching ...

Robotic Imitation of Human Actions

Josua Spisak, Matthias Kerzel, Stefan Wermter

TL;DR

This paper tackles cross-body imitation by enabling a robot to imitate a human demonstration from a different perspective using only a single example. It fuses diffusion-based action segmentation with an open-vocabulary object detector to extract temporal and spatial task structure, which is then refined by symbolic planning and executed via inverse kinematics. The approach demonstrates promising accuracy in action segmentation and reasonably robust 3D localization, but faces challenges in object classification and grasp reliability due to occlusions and detector noise. Overall, it shows that rapid, demonstration-driven imitation is feasible in cluttered environments, reducing the need for extensive training data in robotic skill transfer.

Abstract

Imitation can allow us to quickly gain an understanding of a new task. Through a demonstration, we can gain direct knowledge about which actions need to be performed and which goals they have. In this paper, we introduce a new approach to imitation learning that tackles the challenges of a robot imitating a human, such as the change in perspective and body schema. Our approach can use a single human demonstration to abstract information about the demonstrated task, and use that information to generalise and replicate it. We facilitate this ability by a new integration of two state-of-the-art methods: a diffusion action segmentation model to abstract temporal information from the demonstration and an open vocabulary object detector for spatial information. Furthermore, we refine the abstracted information and use symbolic reasoning to create an action plan utilising inverse kinematics, to allow the robot to imitate the demonstrated action.

Robotic Imitation of Human Actions

TL;DR

This paper tackles cross-body imitation by enabling a robot to imitate a human demonstration from a different perspective using only a single example. It fuses diffusion-based action segmentation with an open-vocabulary object detector to extract temporal and spatial task structure, which is then refined by symbolic planning and executed via inverse kinematics. The approach demonstrates promising accuracy in action segmentation and reasonably robust 3D localization, but faces challenges in object classification and grasp reliability due to occlusions and detector noise. Overall, it shows that rapid, demonstration-driven imitation is feasible in cluttered environments, reducing the need for extensive training data in robotic skill transfer.

Abstract

Imitation can allow us to quickly gain an understanding of a new task. Through a demonstration, we can gain direct knowledge about which actions need to be performed and which goals they have. In this paper, we introduce a new approach to imitation learning that tackles the challenges of a robot imitating a human, such as the change in perspective and body schema. Our approach can use a single human demonstration to abstract information about the demonstrated task, and use that information to generalise and replicate it. We facilitate this ability by a new integration of two state-of-the-art methods: a diffusion action segmentation model to abstract temporal information from the demonstration and an open vocabulary object detector for spatial information. Furthermore, we refine the abstracted information and use symbolic reasoning to create an action plan utilising inverse kinematics, to allow the robot to imitate the demonstrated action.
Paper Structure (9 sections, 4 figures, 1 table)

This paper contains 9 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: The top three images are taken from one of the human demonstrations at three timesteps: when the object is first grasped, during the movement, and after it is released. In the bottom three images, we can see three timesteps taken from the imitation of said demonstration. Again we show the moment when the object is first grasped, being moved, and after it is released. While the demonstration was given by a human sitting across from the robot, the model is able to perform the action from the robot's perspective.
  • Figure 2: Our robot in its natural environment during one of the imitations where it grasps the spam can.
  • Figure 3: An overview of our architecture. The start of the approach is on the left, where the input is the human demonstration. We record the human demonstration as a video with the cameras in our robot's head. This video then goes through two models: our diffusion action segmentation model and the ViLD object detector. The object detector provides us with 3D positions for each detected object in each frame. The action segmentation model provides us with a class label for each frame. The class is either moving with the object or without. This information is used in our logical programming algorithm to create an action plan. This action plan is given to the inverse kinematic of our robot which then moves our robot to imitate the demonstration.
  • Figure 4: The results from the action segmentation on three demonstrations. For each demonstration there is one bar showing the ground truth and one bar showing the results from the action segmentation. For each of them, the time goes from left to right. The two possible classes are shown with different colours: red is for movement without an object, and green is for movement with an object.