Table of Contents
Fetching ...

Bi-Level Motion Imitation for Humanoid Robots

Wenshuai Zhao, Yi Zhao, Joni Pajarinen, Michael Muehlebach

TL;DR

The paper addresses the problem that human MoCap trajectories can be physically infeasible for humanoid robots, which can degrade imitation policies. It introduces Bi-Level Motion Imitation (BMI), a framework that learns a self-consistent latent dynamics model (SCAE) from MoCap data, uses latent parameters to pre-train a robot policy, and then performs bi-level fine-tuning to align decoder outputs with physically feasible robot trajectories while preserving motion patterns. The key contributions are the SCAE for sparse, structured latent representations, the bi-level imitation scheme with latent-space regularization, and empirical validation on a MIT Humanoid model in simulation showing improved policy performance and motion stability across 13 motions. This approach enables scalable, data-driven humanoid imitation that respects physical constraints without explicit dynamics modeling, with potential to leverage large MoCap datasets for real-world applications.

Abstract

Imitation learning from human motion capture (MoCap) data provides a promising way to train humanoid robots. However, due to differences in morphology, such as varying degrees of joint freedom and force limits, exact replication of human behaviors may not be feasible for humanoid robots. Consequently, incorporating physically infeasible MoCap data in training datasets can adversely affect the performance of the robot policy. To address this issue, we propose a bi-level optimization-based imitation learning framework that alternates between optimizing both the robot policy and the target MoCap data. Specifically, we first develop a generative latent dynamics model using a novel self-consistent auto-encoder, which learns sparse and structured motion representations while capturing desired motion patterns in the dataset. The dynamics model is then utilized to generate reference motions while the latent representation regularizes the bi-level motion imitation process. Simulations conducted with a realistic model of a humanoid robot demonstrate that our method enhances the robot policy by modifying reference motions to be physically consistent.

Bi-Level Motion Imitation for Humanoid Robots

TL;DR

The paper addresses the problem that human MoCap trajectories can be physically infeasible for humanoid robots, which can degrade imitation policies. It introduces Bi-Level Motion Imitation (BMI), a framework that learns a self-consistent latent dynamics model (SCAE) from MoCap data, uses latent parameters to pre-train a robot policy, and then performs bi-level fine-tuning to align decoder outputs with physically feasible robot trajectories while preserving motion patterns. The key contributions are the SCAE for sparse, structured latent representations, the bi-level imitation scheme with latent-space regularization, and empirical validation on a MIT Humanoid model in simulation showing improved policy performance and motion stability across 13 motions. This approach enables scalable, data-driven humanoid imitation that respects physical constraints without explicit dynamics modeling, with potential to leverage large MoCap datasets for real-world applications.

Abstract

Imitation learning from human motion capture (MoCap) data provides a promising way to train humanoid robots. However, due to differences in morphology, such as varying degrees of joint freedom and force limits, exact replication of human behaviors may not be feasible for humanoid robots. Consequently, incorporating physically infeasible MoCap data in training datasets can adversely affect the performance of the robot policy. To address this issue, we propose a bi-level optimization-based imitation learning framework that alternates between optimizing both the robot policy and the target MoCap data. Specifically, we first develop a generative latent dynamics model using a novel self-consistent auto-encoder, which learns sparse and structured motion representations while capturing desired motion patterns in the dataset. The dynamics model is then utilized to generate reference motions while the latent representation regularizes the bi-level motion imitation process. Simulations conducted with a realistic model of a humanoid robot demonstrate that our method enhances the robot policy by modifying reference motions to be physically consistent.
Paper Structure (38 sections, 14 equations, 13 figures, 12 tables, 1 algorithm)

This paper contains 38 sections, 14 equations, 13 figures, 12 tables, 1 algorithm.

Figures (13)

  • Figure 1: Structure of the proposed self-consistent auto-encoder (SCAE). The encoder enc first encodes the original trajectory $\tau_t$ into latent space $z_t$. Fourier transformation is then applied to $z_t$ to get latent parameters $\theta_t=(f_t, a_t, b_t)$ while a separate MLP module learns $\phi_t$. A sinusoidal function reconstructs the latent embedding $\hat{z_t}$, followed by the decoder dec recovering the input trajectory $\hat{\tau_t}$. Particularly, we re-input $\hat{\tau_t}$ to the encoder to obtain reconstructed latent embedding $\hat{\bar{z_t}}$. Therefore, SCAE consists of both motion and latent reconstruction losses, as indicated by red arrows. We follow FLD to make multi-step predictions and thus the final loss sums $L_0, \cdots, L_N$.
  • Figure 2: Bi-level motion fine-tuning (BMI) optimizes both the robot policy and the decoder alternatively. The learning begins by sampling from the learned latent space $p(z)$ and decoding these latent samples into target reference motions for robot imitation. The decoder's loss function comprises two components, as indicated by the red arrows: (1) the mean squared error (MSE) between the robot's trajectory and the decoded trajectory, and (2) the latent reconstruction error between the sampled latent embeddings $\hat{z_t}$ and the embeddings of the decoded trajectories $\hat{\bar{z_t}}$.
  • Figure 3: Reconstruction error during training: (a) The reconstruction error of latent embeddings. (b) The reconstruction error of the original motion states.
  • Figure 4: The figure displays the learned latent phases of four motions. Each circle represents a latent channel where the radius is the amplitude and the black bar is the phase timing. Compared to FLD, SCAE takes fewer frequency components and lower amplitudes to represent the same motion.
  • Figure 5: The figure shows the latent manifolds for $13$ motions. Each color corresponds to a trajectory segment from a motion type. The arrows denote the motion evolution direction. The manifold induced by SCAE shows consistent structures across different motions.
  • ...and 8 more figures