MetricHMSR:Metric Human Mesh and Scene Recovery from Monocular Images
Chentao Song, He Zhang, Haolei Yuan, Haozhe Lin, Jianhua Tao, Hongwen Zhang, Tao Yu
TL;DR
MetricHMSR tackles metric human mesh and scene recovery from a single monocular image by introducing pixel-aligned camera rays to encode intrinsic and bounding-box information and by employing a Human Mixture-of-Experts (MoE) architecture that jointly learns local pose, metric shape, and metric position. A depth-refinement module guided by metric human mesh further aligns the reconstructed scene to metric scale, while SynFocal provides focal-length variation to validate robustness. The approach demonstrates state-of-the-art performance on both human mesh recovery and metric depth estimation across diverse datasets, and enables automatic generation of metric pseudo-ground-truth for in-the-wild images. Together, these contributions advance unified 3D understanding of humans and scenes from monocular imagery with practical implications for AR/VR, film, and robotics.
Abstract
We introduce MetricHMSR (Metric Human Mesh and Scene Recovery), a novel approach for metric human mesh and scene recovery from monocular images. Due to unrealistic assumptions in the camera model and inherent challenges in metric perception, existing approaches struggle to achieve human pose and metric 3D position estimation through a unified module. To address this limitation, MetricHMSR incorporates camera rays to comprehensively encode both the bounding box information and the intrinsic parameters of perspective projection. Then we proposed Human Mixture-of-Experts (MoE), the model dynamically routes image features and ray features to task-specific experts for specialized understanding of different data aspects, enabling a unified framework that simultaneously perceives the local pose and the global 3D position. Based on the results above, we further refine the existing monocular metric depth estimation method to achieve more accurate results, ultimately enabling the seamless overlay of humans and scenes in 3D space. Comprehensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on both human mesh and scene recovery.
