Table of Contents
Fetching ...

MUC: Mixture of Uncalibrated Cameras for Robust 3D Human Body Reconstruction

Yitao Zhu, Sheng Wang, Mengjie Xu, Zixu Zhuang, Zhixin Wang, Kaidong Wang, Han Zhang, Qian Wang

TL;DR

The paper addresses robust 3D human body reconstruction from uncalibrated multi-view cameras by combining per-view single-view encodings with learned view-specific reweighting. It introduces Joint Reweighting Network (JRN) to fuse joints across views and Surface Reweighting Network (SRN) to coherently fuse surface details, including facial expressions, within the SMPL-X framework. The approach is end-to-end trainable with dedicated losses, and it demonstrates state-of-the-art performance on public datasets (Human3.6M and RICH) while supporting an arbitrary number of cameras. This calibration-free fusion method advances practical, texture-rich 3D human modeling for behavior analysis, VR/AR, and related applications, with code released for public use.

Abstract

Multiple cameras can provide comprehensive multi-view video coverage of a person. Fusing this multi-view data is crucial for tasks like behavioral analysis, although it traditionally requires camera calibration, a process that is often complex. Moreover, previous studies have overlooked the challenges posed by self-occlusion under multiple views and the continuity of human body shape estimation. In this study, we introduce a method to reconstruct the 3D human body from multiple uncalibrated camera views. Initially, we utilize a pre-trained human body encoder to process each camera view individually, enabling the reconstruction of human body models and parameters for each view along with predicted camera positions. Rather than merely averaging the models across views, we develop a neural network trained to assign weights to individual views for all human body joints, based on the estimated distribution of joint distances from each camera. Additionally, we focus on the mesh surface of the human body for dynamic fusion, allowing for the seamless integration of facial expressions and body shape into a unified human body model. Our method has shown excellent performance in reconstructing the human body on two public datasets, advancing beyond previous work from the SMPL model to the SMPL-X model. This extension incorporates more complex hand poses and facial expressions, enhancing the detail and accuracy of the reconstructions. Crucially, it supports the flexible ad-hoc deployment of any number of cameras, offering significant potential for various applications. Our code is available at https://github.com/AbsterZhu/MUC.

MUC: Mixture of Uncalibrated Cameras for Robust 3D Human Body Reconstruction

TL;DR

The paper addresses robust 3D human body reconstruction from uncalibrated multi-view cameras by combining per-view single-view encodings with learned view-specific reweighting. It introduces Joint Reweighting Network (JRN) to fuse joints across views and Surface Reweighting Network (SRN) to coherently fuse surface details, including facial expressions, within the SMPL-X framework. The approach is end-to-end trainable with dedicated losses, and it demonstrates state-of-the-art performance on public datasets (Human3.6M and RICH) while supporting an arbitrary number of cameras. This calibration-free fusion method advances practical, texture-rich 3D human modeling for behavior analysis, VR/AR, and related applications, with code released for public use.

Abstract

Multiple cameras can provide comprehensive multi-view video coverage of a person. Fusing this multi-view data is crucial for tasks like behavioral analysis, although it traditionally requires camera calibration, a process that is often complex. Moreover, previous studies have overlooked the challenges posed by self-occlusion under multiple views and the continuity of human body shape estimation. In this study, we introduce a method to reconstruct the 3D human body from multiple uncalibrated camera views. Initially, we utilize a pre-trained human body encoder to process each camera view individually, enabling the reconstruction of human body models and parameters for each view along with predicted camera positions. Rather than merely averaging the models across views, we develop a neural network trained to assign weights to individual views for all human body joints, based on the estimated distribution of joint distances from each camera. Additionally, we focus on the mesh surface of the human body for dynamic fusion, allowing for the seamless integration of facial expressions and body shape into a unified human body model. Our method has shown excellent performance in reconstructing the human body on two public datasets, advancing beyond previous work from the SMPL model to the SMPL-X model. This extension incorporates more complex hand poses and facial expressions, enhancing the detail and accuracy of the reconstructions. Crucially, it supports the flexible ad-hoc deployment of any number of cameras, offering significant potential for various applications. Our code is available at https://github.com/AbsterZhu/MUC.
Paper Structure (13 sections, 1 equation, 6 figures, 7 tables)

This paper contains 13 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Diagram of Mixture of Uncalibrated Cameras (MUC), with the architecture of the single-view encoder shown on the top. On the right, a notation table lists the symbols and their definitions used in the model of human body.
  • Figure 2: Joint distance distribution loss involves several key steps. First, we subtract the minimal distance value and normalize the depth maps from the ground truth. Then, for the same joint, we align the predicted scores and distance distributions from different camera positions. This alignment enables the model to effectively mitigate the self-occlusion caused by greater distances from the camera position.
  • Figure 3: Workflow of the Surface Reweighting Network. (a) The mixed body and hand parameters, together with the shape and expression parameters, are transformed into continuous feature maps through UV projection. (b) Employs the camera position as a condition to facilitate cross-attention operations with the feature map, resulting in the prediction of UV map-level weight maps and PCA-reduced level weight vectors.
  • Figure 4: We conducted a qualitative comparison with SMPLer-X across three datasets. The first two groups are from the RICH dataset, the third group is from the Human3.6M dataset, and the last group is from an additional validation dataset, the MARCOnI dataset elhayek2015efficient, which serves as a more challenging test scenario (images were recorded using a handheld smartphone). "O" stands for the original image. "S" stands for single-view reconstruction result by SMPLer-X. "F" stands for the fusion result of our method. Zoom in for better view.
  • Figure 5: Temporal comparison of PA-MPVPE across different camera setups over sequential frames. Mono-view reconstructions from cameras 0, 1, and 2 are depicted in varying colors, while the multi-view reconstruction is represented in red. Zoom in for better view.
  • ...and 1 more figures