Table of Contents
Fetching ...

MoVEInt: Mixture of Variational Experts for Learning Human-Robot Interactions from Demonstrations

Vignesh Prasad, Alap Kshirsagar, Dorothea Koert, Ruth Stock-Homburg, Jan Peters, Georgia Chalvatzaki

TL;DR

The approach of using an informative MDN prior from human observations for a VAE generates more accurate robot motions compared to previous HMM-based or recurrent approaches of learning shared latent representations, which validate on various HRI datasets involving interactions such as handshakes, fistbumps, waving, and handovers.

Abstract

Shared dynamics models are important for capturing the complexity and variability inherent in Human-Robot Interaction (HRI). Therefore, learning such shared dynamics models can enhance coordination and adaptability to enable successful reactive interactions with a human partner. In this work, we propose a novel approach for learning a shared latent space representation for HRIs from demonstrations in a Mixture of Experts fashion for reactively generating robot actions from human observations. We train a Variational Autoencoder (VAE) to learn robot motions regularized using an informative latent space prior that captures the multimodality of the human observations via a Mixture Density Network (MDN). We show how our formulation derives from a Gaussian Mixture Regression formulation that is typically used approaches for learning HRI from demonstrations such as using an HMM/GMM for learning a joint distribution over the actions of the human and the robot. We further incorporate an additional regularization to prevent "mode collapse", a common phenomenon when using latent space mixture models with VAEs. We find that our approach of using an informative MDN prior from human observations for a VAE generates more accurate robot motions compared to previous HMM-based or recurrent approaches of learning shared latent representations, which we validate on various HRI datasets involving interactions such as handshakes, fistbumps, waving, and handovers. Further experiments in a real-world human-to-robot handover scenario show the efficacy of our approach for generating successful interactions with four different human interaction partners.

MoVEInt: Mixture of Variational Experts for Learning Human-Robot Interactions from Demonstrations

TL;DR

The approach of using an informative MDN prior from human observations for a VAE generates more accurate robot motions compared to previous HMM-based or recurrent approaches of learning shared latent representations, which validate on various HRI datasets involving interactions such as handshakes, fistbumps, waving, and handovers.

Abstract

Shared dynamics models are important for capturing the complexity and variability inherent in Human-Robot Interaction (HRI). Therefore, learning such shared dynamics models can enhance coordination and adaptability to enable successful reactive interactions with a human partner. In this work, we propose a novel approach for learning a shared latent space representation for HRIs from demonstrations in a Mixture of Experts fashion for reactively generating robot actions from human observations. We train a Variational Autoencoder (VAE) to learn robot motions regularized using an informative latent space prior that captures the multimodality of the human observations via a Mixture Density Network (MDN). We show how our formulation derives from a Gaussian Mixture Regression formulation that is typically used approaches for learning HRI from demonstrations such as using an HMM/GMM for learning a joint distribution over the actions of the human and the robot. We further incorporate an additional regularization to prevent "mode collapse", a common phenomenon when using latent space mixture models with VAEs. We find that our approach of using an informative MDN prior from human observations for a VAE generates more accurate robot motions compared to previous HMM-based or recurrent approaches of learning shared latent representations, which we validate on various HRI datasets involving interactions such as handshakes, fistbumps, waving, and handovers. Further experiments in a real-world human-to-robot handover scenario show the efficacy of our approach for generating successful interactions with four different human interaction partners.
Paper Structure (17 sections, 9 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 9 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Target poses generated in a reactive manner by the mixture of policies learned by MoVEInt for a Handshake interaction with the humanoid robot Pepper. MoVEInt generates multiple policies (shown in green, magenta, and orange) based on human observations which are then combined to generate a suitable response motion for the robot.
  • Figure 2: Overview of our approach "MoVEInt". We train a reactive policy using a Mixture Density Network (MDN) to predict latent space robot actions from human observations. The MDN policy is used not just for reactively generating the robot's actions, but also to regularize a VAE that learns a latent representation of the robot's actions. This regularization ensures that the learned robot representation matches the predicted MDN policy and also ensures that the robot VAE learns to decode samples from the MDN policy.
  • Figure 3: Sample Human-Robot Interactions generated with the reactive motions generated by MoVEInt for a Bimanual Handover scenario.
  • Figure 4: Sample trajectories generated by MoVEInt for the Bimanual and Unimanual Handovers in the HHI-Handovers dataset in kshirsagar2023dataset. The 3D plots show the reconstructed trajectories and the 2D plots show the corresponding progression of $\alpha_i(\textcolor{red}{\boldsymbol{x}^h_t})$ for the different components of the MDN. In the 3D plots, the observed trajectory of the receiver is shown in red and the generated trajectory of the giver is shown in blue and the giver's corresponding ground truth is shown in black. The reconstruction of the individual latent components of the MDN are shown in green, magenta, and orange. It can be seen that the learned components correspond to different parts of the task space. For example, green denotes the hand locations for a unimanual handover, magenta denotes the hand locations for a bimanual handover, and orange denotes the static hand locations for the starting and ending neutral poses. In the 2D plot, it can be seen how the coefficients for components corresponding to bimanual (magenta) and unimanual (green) get activated based on the interaction being performed, while the component corresponding to a neutral pose (orange) gets activated at the beginning of the interaction while both partners are static.