Table of Contents
Fetching ...

From Camera to World: A Plug-and-Play Module for Human Mesh Transformation

Changhai Ma, Ziyu Wu, Yunkang Zhang, Qijun Ying, Boyan Liu, Xiaohui Cai

TL;DR

This work tackles the challenge of reconstructing 3D human meshes in the world coordinate system from monocular images by addressing unknown camera rotation. It introduces Mesh-Plug, a plug-and-play framework comprising CamNet for human-centered camera rotation estimation using RGB and depth renders, and CorrectNet for mesh adjustment that aligns the initial camera-coordinate SMPL parameters to the world frame. A hybrid loss L_mix enhances root-orientation accuracy while maintaining pose coherence. Extensive experiments on SPEC-SYN and SPEC-MTP demonstrate state-of-the-art improvements, validating the effectiveness of decoupling camera rotation from mesh refinement and of occlusion-aware training and ablation-driven design choices.

Abstract

Reconstructing accurate 3D human meshes in the world coordinate system from in-the-wild images remains challenging due to the lack of camera rotation information. While existing methods achieve promising results in the camera coordinate system by assuming zero camera rotation, this simplification leads to significant errors when transforming the reconstructed mesh to the world coordinate system. To address this challenge, we propose Mesh-Plug, a plug-and-play module that accurately transforms human meshes from camera coordinates to world coordinates. Our key innovation lies in a human-centered approach that leverages both RGB images and depth maps rendered from the initial mesh to estimate camera rotation parameters, eliminating the dependency on environmental cues. Specifically, we first train a camera rotation prediction module that focuses on the human body's spatial configuration to estimate camera pitch angle. Then, by integrating the predicted camera parameters with the initial mesh, we design a mesh adjustment module that simultaneously refines the root joint orientation and body pose. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on the benchmark datasets SPEC-SYN and SPEC-MTP.

From Camera to World: A Plug-and-Play Module for Human Mesh Transformation

TL;DR

This work tackles the challenge of reconstructing 3D human meshes in the world coordinate system from monocular images by addressing unknown camera rotation. It introduces Mesh-Plug, a plug-and-play framework comprising CamNet for human-centered camera rotation estimation using RGB and depth renders, and CorrectNet for mesh adjustment that aligns the initial camera-coordinate SMPL parameters to the world frame. A hybrid loss L_mix enhances root-orientation accuracy while maintaining pose coherence. Extensive experiments on SPEC-SYN and SPEC-MTP demonstrate state-of-the-art improvements, validating the effectiveness of decoupling camera rotation from mesh refinement and of occlusion-aware training and ablation-driven design choices.

Abstract

Reconstructing accurate 3D human meshes in the world coordinate system from in-the-wild images remains challenging due to the lack of camera rotation information. While existing methods achieve promising results in the camera coordinate system by assuming zero camera rotation, this simplification leads to significant errors when transforming the reconstructed mesh to the world coordinate system. To address this challenge, we propose Mesh-Plug, a plug-and-play module that accurately transforms human meshes from camera coordinates to world coordinates. Our key innovation lies in a human-centered approach that leverages both RGB images and depth maps rendered from the initial mesh to estimate camera rotation parameters, eliminating the dependency on environmental cues. Specifically, we first train a camera rotation prediction module that focuses on the human body's spatial configuration to estimate camera pitch angle. Then, by integrating the predicted camera parameters with the initial mesh, we design a mesh adjustment module that simultaneously refines the root joint orientation and body pose. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on the benchmark datasets SPEC-SYN and SPEC-MTP.

Paper Structure

This paper contains 17 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of CLIFF cliff estimation results with our estimation results. Due to the pitch movement of the camera, the reconstructed results of CLIFF cliff appear tilted. In contrast, our method reconstructs a human mesh that is more consistent with reality.
  • Figure 2: Ours pipeline overview. Given a monocular RGB image and the initial SMPL parameters in the camera coordinate system, we first use the SMPL model to render an RGB image and a depth map from the camera's perspective. These are then input into CamNet to estimate the camera's pitch angle (during training, we estimate the pitch and roll angles, but only the pitch angle is used as input during human mesh reconstruction). Subsequently, the RGB image, initial SMPL parameters, and camera pitch angle are fed into CorrectNet to obtain the human model in the world coordinate system.
  • Figure 3: (a) Illustration of the AMASS-Cam data collection process (b)From top to bottom, the effects of different pitch, roll, and yaw angles on the position of the person in the image.
  • Figure 4: Qualitative comparison between REFIT refit, SPEC spec and Ours on SPEC-MTP spec