Table of Contents
Fetching ...

Benchmarking Humanoid Imitation Learning with Motion Difficulty

Zhaorui Meng, Lu Yin, Xinrui Chen, Anjun Chen, Shihui Guo, Yipeng Qin

TL;DR

The paper tackles the problem that existing evaluation metrics in physics-based humanoid motion imitation conflate policy performance with the intrinsic difficulty of motions. It introduces Motion Difficulty Score (MDS), a physics-grounded metric defined as the torque variation induced by bounded pose perturbations, decomposed into Spectral Diversity, Variance Diversity, and Segment Diversity, and complemented by a difficulty-aware dataset MD-AMASS. The authors validate MDS as a strong predictor of imitation error, and they derive MID and DSJE to enable finer, difficulty-aware evaluation. The work also demonstrates curriculum-learning benefits and demonstrates broader applicability of MDS for motion-quality assessment, anomaly detection, and cross-robot generalization.

Abstract

Physics-based motion imitation is central to humanoid control, yet current evaluation metrics (e.g., joint position error) only measure how well a policy imitates but not how difficult the motion itself is. This conflates policy performance with motion difficulty, obscuring whether failures stem from poor learning or inherently challenging motions. In this work, we address this gap with Motion Difficulty Score (MDS), a novel metric that defines and quantifies imitation difficulty independent of policy performance. Grounded in rigid-body dynamics, MDS interprets difficulty as the torque variation induced by small pose perturbations: larger torque-to-pose variation yields flatter reward landscapes and thus higher learning difficulty. MDS captures this through three properties of the perturbation-induced torque space: volume, variance, and temporal variability. We also use it to construct MD-AMASS, a difficulty-aware repartitioning of the AMASS dataset. Empirically, we rigorously validate MDS by demonstrating its explanatory power on the performance of state-of-the-art motion imitation policies. We further demonstrate the utility of MDS through two new MDS-based metrics: Maximum Imitable Difficulty (MID) and Difficulty-Stratified Joint Error (DSJE), providing fresh insights into imitation learning.

Benchmarking Humanoid Imitation Learning with Motion Difficulty

TL;DR

The paper tackles the problem that existing evaluation metrics in physics-based humanoid motion imitation conflate policy performance with the intrinsic difficulty of motions. It introduces Motion Difficulty Score (MDS), a physics-grounded metric defined as the torque variation induced by bounded pose perturbations, decomposed into Spectral Diversity, Variance Diversity, and Segment Diversity, and complemented by a difficulty-aware dataset MD-AMASS. The authors validate MDS as a strong predictor of imitation error, and they derive MID and DSJE to enable finer, difficulty-aware evaluation. The work also demonstrates curriculum-learning benefits and demonstrates broader applicability of MDS for motion-quality assessment, anomaly detection, and cross-robot generalization.

Abstract

Physics-based motion imitation is central to humanoid control, yet current evaluation metrics (e.g., joint position error) only measure how well a policy imitates but not how difficult the motion itself is. This conflates policy performance with motion difficulty, obscuring whether failures stem from poor learning or inherently challenging motions. In this work, we address this gap with Motion Difficulty Score (MDS), a novel metric that defines and quantifies imitation difficulty independent of policy performance. Grounded in rigid-body dynamics, MDS interprets difficulty as the torque variation induced by small pose perturbations: larger torque-to-pose variation yields flatter reward landscapes and thus higher learning difficulty. MDS captures this through three properties of the perturbation-induced torque space: volume, variance, and temporal variability. We also use it to construct MD-AMASS, a difficulty-aware repartitioning of the AMASS dataset. Empirically, we rigorously validate MDS by demonstrating its explanatory power on the performance of state-of-the-art motion imitation policies. We further demonstrate the utility of MDS through two new MDS-based metrics: Maximum Imitable Difficulty (MID) and Difficulty-Stratified Joint Error (DSJE), providing fresh insights into imitation learning.

Paper Structure

This paper contains 30 sections, 2 theorems, 27 equations, 8 figures, 4 tables.

Key Result

Proposition A.1

For a given motion sequence $S$, the volume of the torque variation $\mathcal{T}$ induced by $\mathcal{N}(S)$ satisfies: where $G_i = dF_i(s_i) \, dF_i(s_i)^T$. Thus, the volume $\text{vol}_Y(\mathcal{T})$ is determined by $\prod_{i=1}^t \sqrt{\det(G_i)}$.

Figures (8)

  • Figure 1: Our Motion Difficulty Score (MDS) accurately quantifies motion difficulty: higher MDS $\to$ higher error on policies $\to$ harder to imitate. Beyond validation, MDS reveals nuanced insights into imitation learning: e.g., PHC+ luo2023universal dominates overall but UHC luo2021dynamics outperforms it on easy motions, challenging a common belief in this field. Leveraging this benchmark, we introduce Maximun Imitation Difficulty (MID) and Difficulty-stratified Joint Error (DSJE), enabling fine-grained, difficulty-aware evaluation of imitation learning.
  • Figure 2: Illustration of MDS: For an easy motion (top-left), small pose perturbations induce small torque variance and hence low sensitivity to perturbation, the 1) smaller torque space volume, 2) larger variation across joints (larger volume on the waving hand joints than others) and 3) larger temporal variability makes MDS rates the motion as easier. In contract, for a a difficult motion (bottom-left), the same level of perturbation yields large torque variance across all joints, leading to a high MDS.
  • Figure 3: Our Difficulty-aware AMASS Dataset MD-AMASS.
  • Figure 4: We plot scatters of MDS versus imitation error on different polices for over 3000 motion clips (left: UHC; right: PHC+), with samples drawn from the policies' training set. As MDS increases from left to right (indicating higher difficulty), error rises from bottom to top, demonstrating that MDS effectively captures motion difficulty.
  • Figure 5: Imitation fidelity visibly degrades as MDS increases, validating MDS as an accurate model of motion difficulty.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Definition 4.1: Motion Difficulty
  • Proposition A.1
  • proof
  • Proposition A.2
  • proof