Table of Contents
Fetching ...

MetricHMSR:Metric Human Mesh and Scene Recovery from Monocular Images

Chentao Song, He Zhang, Haolei Yuan, Haozhe Lin, Jianhua Tao, Hongwen Zhang, Tao Yu

TL;DR

MetricHMSR tackles metric human mesh and scene recovery from a single monocular image by introducing pixel-aligned camera rays to encode intrinsic and bounding-box information and by employing a Human Mixture-of-Experts (MoE) architecture that jointly learns local pose, metric shape, and metric position. A depth-refinement module guided by metric human mesh further aligns the reconstructed scene to metric scale, while SynFocal provides focal-length variation to validate robustness. The approach demonstrates state-of-the-art performance on both human mesh recovery and metric depth estimation across diverse datasets, and enables automatic generation of metric pseudo-ground-truth for in-the-wild images. Together, these contributions advance unified 3D understanding of humans and scenes from monocular imagery with practical implications for AR/VR, film, and robotics.

Abstract

We introduce MetricHMSR (Metric Human Mesh and Scene Recovery), a novel approach for metric human mesh and scene recovery from monocular images. Due to unrealistic assumptions in the camera model and inherent challenges in metric perception, existing approaches struggle to achieve human pose and metric 3D position estimation through a unified module. To address this limitation, MetricHMSR incorporates camera rays to comprehensively encode both the bounding box information and the intrinsic parameters of perspective projection. Then we proposed Human Mixture-of-Experts (MoE), the model dynamically routes image features and ray features to task-specific experts for specialized understanding of different data aspects, enabling a unified framework that simultaneously perceives the local pose and the global 3D position. Based on the results above, we further refine the existing monocular metric depth estimation method to achieve more accurate results, ultimately enabling the seamless overlay of humans and scenes in 3D space. Comprehensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on both human mesh and scene recovery.

MetricHMSR:Metric Human Mesh and Scene Recovery from Monocular Images

TL;DR

MetricHMSR tackles metric human mesh and scene recovery from a single monocular image by introducing pixel-aligned camera rays to encode intrinsic and bounding-box information and by employing a Human Mixture-of-Experts (MoE) architecture that jointly learns local pose, metric shape, and metric position. A depth-refinement module guided by metric human mesh further aligns the reconstructed scene to metric scale, while SynFocal provides focal-length variation to validate robustness. The approach demonstrates state-of-the-art performance on both human mesh recovery and metric depth estimation across diverse datasets, and enables automatic generation of metric pseudo-ground-truth for in-the-wild images. Together, these contributions advance unified 3D understanding of humans and scenes from monocular imagery with practical implications for AR/VR, film, and robotics.

Abstract

We introduce MetricHMSR (Metric Human Mesh and Scene Recovery), a novel approach for metric human mesh and scene recovery from monocular images. Due to unrealistic assumptions in the camera model and inherent challenges in metric perception, existing approaches struggle to achieve human pose and metric 3D position estimation through a unified module. To address this limitation, MetricHMSR incorporates camera rays to comprehensively encode both the bounding box information and the intrinsic parameters of perspective projection. Then we proposed Human Mixture-of-Experts (MoE), the model dynamically routes image features and ray features to task-specific experts for specialized understanding of different data aspects, enabling a unified framework that simultaneously perceives the local pose and the global 3D position. Based on the results above, we further refine the existing monocular metric depth estimation method to achieve more accurate results, ultimately enabling the seamless overlay of humans and scenes in 3D space. Comprehensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on both human mesh and scene recovery.

Paper Structure

This paper contains 41 sections, 8 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 2: Overview. MetricHMSR recover human pose, metric shape, global position, and the scene from single image. The images and corresponding camera rays are encoded into tokens and fed into the Patch MoE. The Patch MoE automatically routes these tokens to appropriate experts for specialized knowledge learning. The outputs of the Patch MoE are then passed to the Global MoE, which further refines the learning at the image level. Finally, the processed features are passed through output heads to recover the metric human mesh. We leverage the metric human mesh as the reference to refine the depth estimated by MapAnything, achieving precise metric depth estimation. This ultimately enables an accurate 3D overlay of humans and scenes in metric scale. $\oplus$ denotes concatenate.
  • Figure 3: The architecture of MoE Layer. We designed a ray expert to learn features from camera rays, 4 routed image experts to process specialized image knowledge, and a shared image expert to capture common image knowledge.
  • Figure 4: Routing heatmap of the deepest (last) MoE layer for image feature on 3DPW.
  • Figure 5: Expert allocation maps of the deepest (last) MoE layer. Different color blocks denote different expert assignments.
  • Figure 6: Examples of SynFocal. Each column of images was rendered using a distinct focal length (pixels).
  • ...and 9 more figures