Table of Contents
Fetching ...

MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction

Qiang Zhang, Jiahao Ma, Peiran Liu, Shuai Shi, Zeran Su, Zifan Wang, Jingkai Sun, Wei Cui, Jialin Yu, Gang Han, Wen Zhao, Pihai Sun, Kangning Yin, Jiaxu Wang, Jiahang Cao, Lingfeng Zhang, Hao Cheng, Xiaoshuai Hao, Yiding Ji, Junwei Liang, Jian Tang, Renjing Xu, Yijie Guo

TL;DR

MeshMimic addresses the practical bottleneck in humanoid learning: obtaining physically consistent motion data without motion capture or explicit sensing of the environment. By leveraging monocular video, it reconstructs both human trajectories and detailed 3D terrain meshes, then applies kinematic-consistency optimization and a contact-aware MeshRetarget to map to a humanoid for RL training in a Real-to-Sim-to-Real loop. The approach yields improved motion and scene reconstruction over baselines, enabling terrain-aware, dynamic behaviors with a low-cost data pipeline. This work paves the way for scalable, autonomous humanoid learning in unstructured environments using consumer-grade vision, with strong implications for real-world deployment and generalization to diverse terrains.

Abstract

Humanoid motion control has witnessed significant breakthroughs in recent years, with deep reinforcement learning (RL) emerging as a primary catalyst for achieving complex, human-like behaviors. However, the high dimensionality and intricate dynamics of humanoid robots make manual motion design impractical, leading to a heavy reliance on expensive motion capture (MoCap) data. These datasets are not only costly to acquire but also frequently lack the necessary geometric context of the surrounding physical environment. Consequently, existing motion synthesis frameworks often suffer from a decoupling of motion and scene, resulting in physical inconsistencies such as contact slippage or mesh penetration during terrain-aware tasks. In this work, we present MeshMimic, an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video. By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects. We introduce an optimization algorithm based on kinematic consistency to extract high-quality motion data from noisy visual reconstructions, alongside a contact-invariant retargeting method that transfers human-environment interaction features to the humanoid agent. Experimental results demonstrate that MeshMimic achieves robust, highly dynamic performance across diverse and challenging terrains. Our approach proves that a low-cost pipeline utilizing only consumer-grade monocular sensors can facilitate the training of complex physical interactions, offering a scalable path toward the autonomous evolution of humanoid robots in unstructured environments.

MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction

TL;DR

MeshMimic addresses the practical bottleneck in humanoid learning: obtaining physically consistent motion data without motion capture or explicit sensing of the environment. By leveraging monocular video, it reconstructs both human trajectories and detailed 3D terrain meshes, then applies kinematic-consistency optimization and a contact-aware MeshRetarget to map to a humanoid for RL training in a Real-to-Sim-to-Real loop. The approach yields improved motion and scene reconstruction over baselines, enabling terrain-aware, dynamic behaviors with a low-cost data pipeline. This work paves the way for scalable, autonomous humanoid learning in unstructured environments using consumer-grade vision, with strong implications for real-world deployment and generalization to diverse terrains.

Abstract

Humanoid motion control has witnessed significant breakthroughs in recent years, with deep reinforcement learning (RL) emerging as a primary catalyst for achieving complex, human-like behaviors. However, the high dimensionality and intricate dynamics of humanoid robots make manual motion design impractical, leading to a heavy reliance on expensive motion capture (MoCap) data. These datasets are not only costly to acquire but also frequently lack the necessary geometric context of the surrounding physical environment. Consequently, existing motion synthesis frameworks often suffer from a decoupling of motion and scene, resulting in physical inconsistencies such as contact slippage or mesh penetration during terrain-aware tasks. In this work, we present MeshMimic, an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video. By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects. We introduce an optimization algorithm based on kinematic consistency to extract high-quality motion data from noisy visual reconstructions, alongside a contact-invariant retargeting method that transfers human-environment interaction features to the humanoid agent. Experimental results demonstrate that MeshMimic achieves robust, highly dynamic performance across diverse and challenging terrains. Our approach proves that a low-cost pipeline utilizing only consumer-grade monocular sensors can facilitate the training of complex physical interactions, offering a scalable path toward the autonomous evolution of humanoid robots in unstructured environments.
Paper Structure (21 sections, 6 equations, 4 figures, 3 tables)

This paper contains 21 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: MeshMimic Real-Sim-Real Pipeline. Starting from a monocular video, we reconstruct the scene geometry and human motion, jointly align them to recover metrically consistent human--scene interactions, and retarget the refined motion to a humanoid in simulation for RL policy learning. Finally, we deploy the learned policy to the real robot enabling stable execution over challenging terrain.
  • Figure 2: Left: Depth-edge--guided contact prediction. Right: MeshRetargeting optimization for penetration correction and contact-consistent retargeting.
  • Figure 3: Comparison with VideoMimic.
  • Figure 4: Effect of motion and terrain reconstruction on training and deployment performance (MMM: MeshMimic Motion; VMM: VideoMimic Motion; MMT: MeshMimic Terrain; VMT: VideoMimic Terrain).