From Camera to World: A Plug-and-Play Module for Human Mesh Transformation
Changhai Ma, Ziyu Wu, Yunkang Zhang, Qijun Ying, Boyan Liu, Xiaohui Cai
TL;DR
This work tackles the challenge of reconstructing 3D human meshes in the world coordinate system from monocular images by addressing unknown camera rotation. It introduces Mesh-Plug, a plug-and-play framework comprising CamNet for human-centered camera rotation estimation using RGB and depth renders, and CorrectNet for mesh adjustment that aligns the initial camera-coordinate SMPL parameters to the world frame. A hybrid loss L_mix enhances root-orientation accuracy while maintaining pose coherence. Extensive experiments on SPEC-SYN and SPEC-MTP demonstrate state-of-the-art improvements, validating the effectiveness of decoupling camera rotation from mesh refinement and of occlusion-aware training and ablation-driven design choices.
Abstract
Reconstructing accurate 3D human meshes in the world coordinate system from in-the-wild images remains challenging due to the lack of camera rotation information. While existing methods achieve promising results in the camera coordinate system by assuming zero camera rotation, this simplification leads to significant errors when transforming the reconstructed mesh to the world coordinate system. To address this challenge, we propose Mesh-Plug, a plug-and-play module that accurately transforms human meshes from camera coordinates to world coordinates. Our key innovation lies in a human-centered approach that leverages both RGB images and depth maps rendered from the initial mesh to estimate camera rotation parameters, eliminating the dependency on environmental cues. Specifically, we first train a camera rotation prediction module that focuses on the human body's spatial configuration to estimate camera pitch angle. Then, by integrating the predicted camera parameters with the initial mesh, we design a mesh adjustment module that simultaneously refines the root joint orientation and body pose. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on the benchmark datasets SPEC-SYN and SPEC-MTP.
