Table of Contents
Fetching ...

ImitationNet: Unsupervised Human-to-Robot Motion Retargeting via Shared Latent Space

Yashuai Yan, Esteve Valls Mascaro, Dongheui Lee

TL;DR

ImitationNet introduces an unsupervised framework for human-to-robot motion retargeting by learning a shared latent space for human poses and robot joints via adaptive contrastive learning and a global-rotation similarity metric. Encoders map human and robot poses to a common latent representation, and a decoder translates latent codes into robot joint commands, enabling direct control and latent-space interpolation for in-between motions. The method eliminates the need for paired data, achieves real-time performance on a Tiago++ robot, and supports multiple input modalities including text and RGB video. Results show improved retargeting precision and efficiency over a supervised baseline and demonstrate scalable generalization to new robots and modalities.

Abstract

This paper introduces a novel deep-learning approach for human-to-robot motion retargeting, enabling robots to mimic human poses accurately. Contrary to prior deep-learning-based works, our method does not require paired human-to-robot data, which facilitates its translation to new robots. First, we construct a shared latent space between humans and robots via adaptive contrastive learning that takes advantage of a proposed cross-domain similarity metric between the human and robot poses. Additionally, we propose a consistency term to build a common latent space that captures the similarity of the poses with precision while allowing direct robot motion control from the latent space. For instance, we can generate in-between motion through simple linear interpolation between two projected human poses. We conduct a comprehensive evaluation of robot control from diverse modalities (i.e., texts, RGB videos, and key poses), which facilitates robot control for non-expert users. Our model outperforms existing works regarding human-to-robot retargeting in terms of efficiency and precision. Finally, we implemented our method in a real robot with self-collision avoidance through a whole-body controller to showcase the effectiveness of our approach. More information on our website https://evm7.github.io/UnsH2R/

ImitationNet: Unsupervised Human-to-Robot Motion Retargeting via Shared Latent Space

TL;DR

ImitationNet introduces an unsupervised framework for human-to-robot motion retargeting by learning a shared latent space for human poses and robot joints via adaptive contrastive learning and a global-rotation similarity metric. Encoders map human and robot poses to a common latent representation, and a decoder translates latent codes into robot joint commands, enabling direct control and latent-space interpolation for in-between motions. The method eliminates the need for paired data, achieves real-time performance on a Tiago++ robot, and supports multiple input modalities including text and RGB video. Results show improved retargeting precision and efficiency over a supervised baseline and demonstrate scalable generalization to new robots and modalities.

Abstract

This paper introduces a novel deep-learning approach for human-to-robot motion retargeting, enabling robots to mimic human poses accurately. Contrary to prior deep-learning-based works, our method does not require paired human-to-robot data, which facilitates its translation to new robots. First, we construct a shared latent space between humans and robots via adaptive contrastive learning that takes advantage of a proposed cross-domain similarity metric between the human and robot poses. Additionally, we propose a consistency term to build a common latent space that captures the similarity of the poses with precision while allowing direct robot motion control from the latent space. For instance, we can generate in-between motion through simple linear interpolation between two projected human poses. We conduct a comprehensive evaluation of robot control from diverse modalities (i.e., texts, RGB videos, and key poses), which facilitates robot control for non-expert users. Our model outperforms existing works regarding human-to-robot retargeting in terms of efficiency and precision. Finally, we implemented our method in a real robot with self-collision avoidance through a whole-body controller to showcase the effectiveness of our approach. More information on our website https://evm7.github.io/UnsH2R/
Paper Structure (21 sections, 5 equations, 6 figures, 2 tables)

This paper contains 21 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our human-to-robot motion retargeting connects robot control with diverse source modalities, such as a text description, an RGB video, or key poses. Our approach can encode human skeletons into a shared latent space between humans and robots, and subsequently decode these latent variables into the robot's joint space, enabling direct robot control. Additionally, our approach facilitates the generation of smooth robot motions between human key poses (represented as green and blue dots) through interpolation within the latent space (indicated by the orange dots).
  • Figure 2: Model overview. Two human poses $(\mathbf{x}_{h}^{i}, \mathbf{x}_{h}^{j})$ are encoded into latent variables $(z^{i}, z^{j})$ within the shared space using the function $Q_{h}$. Similarly, a robot data $\mathbf{x}_{r}^{k}$ is mapped into $z^{k}$ by $Q_{r}$. Given three samples $(z^{i}, z^{j}, z^{k})$, $z^{i}$ is randomly chosen as an anchor $z_{o}^{i}$, and $z^{j}, z^{k}$ are estimated as a negative $z_{-}^{j}$ and positive $z_{+}^{k}$ sample through similarity metric in Equation \ref{['eq:globaldisstancerotation']}. The triplet loss $\mathcal{L}_{triplet}$ constrains the construction of the latent space by bringing $z_{o}^{i}$ and $z_{+}^{k}$ closer and pushing $z_{o}^{i}$ and $z_{-}^{j}$ apart. The decoder $D_{r}$ decodes latent variable $z^{k}$ into $\mathbf{\hat{x}}_{r}^{k}$ that should be consistent with the robot data $\mathbf{x}_{r}^{k}$ regarding $\mathcal{L}_{rec}$. The latent variable $z^{j}$ from the human data $\mathbf{x}_{h}^{j}$ is mapped into a robot data $\mathbf{\hat{x}}_{r}^{j}$. To ensure that $\mathbf{\hat{x}}_{r}^{j}$ is from the same distribution as $\mathbf{x}_{r}^{k}$, $Q_{r}$ encodes $\mathbf{\hat{x}}_{r}^{j}$ back to latent variable $\hat{z}^{j}$, and $\mathcal{L}_{ltc}$ minimizes the distance between $\hat{z}^{j}$ and $z^{j}$. During the inference phase, $\mathbf{\hat{x}}_{r}^{j}$ is used to control the robot directly to mimic human pose $\mathbf{x}_{h}^{j}$.
  • Figure 3: Human Retargeting comparison for different key poses. Various human skeleton key poses are retargeted to the Thiago robot. Our model captures the initial pose's visual similarity and is closely related to the manually annotated ground-truth poses.
  • Figure 4: Video-to-Motion. We leverage the state-of-the-art off-the-shelf 3D human pose estimator li2021hybrik to translate RGB images into human skeletons. Then we employ our proposed method to achieve direct motion control from human skeletons.
  • Figure 5: Text-to-Motion. Our model can connect as a pipeline to pre-trained motion synthesis models. In this case, we first use Text-to-Motion Retrieval petrovich23tmr to get human motion in skeleton representation. Then, we utilize our proposed method to translate the motion into robot control commands (i.e., joint angles) to mimic it.
  • ...and 1 more figures