Table of Contents
Fetching ...

Mimicking-Bench: A Benchmark for Generalizable Humanoid-Scene Interaction Learning via Human Mimicking

Yun Liu, Bowen Yang, Licheng Zhong, He Wang, Li Yi

TL;DR

Mimicking-Bench introduces a comprehensive benchmark for learning generalizable humanoid-scene interaction by mimicking large-scale human references. It pairs a six-task, geometry-rich environment with a modular three-stage skill-learning paradigm (retargeting, tracking, imitation) and a large-scale human reference dataset to enable pipeline-level and modular evaluations. The experiments demonstrate that human mimicking improves task success and generalization, while highlighting critical design choices across retargeting, tracking, and imitation components and identifying directions for future research, such as dexterous hand integration. The work provides a foundation for systematic, scalable exploration of humanoid–scene interaction learning in both simulated and real-world contexts.

Abstract

Learning generic skills for humanoid robots interacting with 3D scenes by mimicking human data is a key research challenge with significant implications for robotics and real-world applications. However, existing methodologies and benchmarks are constrained by the use of small-scale, manually collected demonstrations, lacking the general dataset and benchmark support necessary to explore scene geometry generalization effectively. To address this gap, we introduce Mimicking-Bench, the first comprehensive benchmark designed for generalizable humanoid-scene interaction learning through mimicking large-scale human animation references. Mimicking-Bench includes six household full-body humanoid-scene interaction tasks, covering 11K diverse object shapes, along with 20K synthetic and 3K real-world human interaction skill references. We construct a complete humanoid skill learning pipeline and benchmark approaches for motion retargeting, motion tracking, imitation learning, and their various combinations. Extensive experiments highlight the value of human mimicking for skill learning, revealing key challenges and research directions.

Mimicking-Bench: A Benchmark for Generalizable Humanoid-Scene Interaction Learning via Human Mimicking

TL;DR

Mimicking-Bench introduces a comprehensive benchmark for learning generalizable humanoid-scene interaction by mimicking large-scale human references. It pairs a six-task, geometry-rich environment with a modular three-stage skill-learning paradigm (retargeting, tracking, imitation) and a large-scale human reference dataset to enable pipeline-level and modular evaluations. The experiments demonstrate that human mimicking improves task success and generalization, while highlighting critical design choices across retargeting, tracking, and imitation components and identifying directions for future research, such as dexterous hand integration. The work provides a foundation for systematic, scalable exploration of humanoid–scene interaction learning in both simulated and real-world contexts.

Abstract

Learning generic skills for humanoid robots interacting with 3D scenes by mimicking human data is a key research challenge with significant implications for robotics and real-world applications. However, existing methodologies and benchmarks are constrained by the use of small-scale, manually collected demonstrations, lacking the general dataset and benchmark support necessary to explore scene geometry generalization effectively. To address this gap, we introduce Mimicking-Bench, the first comprehensive benchmark designed for generalizable humanoid-scene interaction learning through mimicking large-scale human animation references. Mimicking-Bench includes six household full-body humanoid-scene interaction tasks, covering 11K diverse object shapes, along with 20K synthetic and 3K real-world human interaction skill references. We construct a complete humanoid skill learning pipeline and benchmark approaches for motion retargeting, motion tracking, imitation learning, and their various combinations. Extensive experiments highlight the value of human mimicking for skill learning, revealing key challenges and research directions.

Paper Structure

This paper contains 48 sections, 2 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Mimicking-Bench simulation configurations. (a) exemplifies an interaction scenario of H1 in Isaac Gym. (b) and (c) show the captured elevation map and color images from four egocentric cameras.
  • Figure 2: Humanoid interaction skill learning paradigm.
  • Figure 3: Qualitative comparisons on data-driven human mimicking and data-free RL on sitting sofas. RL struggles to get reasonable poses despite completing the task kinematically.
  • Figure 4: LB task success rates on varying object sizes. Object length/width refers to the size of bounding boxes and object height refers to the height of lying planes of beds.
  • Figure 5: The ascending trend of task success rates on growing training data scale.
  • ...and 2 more figures