Table of Contents
Fetching ...

PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing

Peng Li, Wangguandong Zheng, Yuan Liu, Tao Yu, Yangguang Li, Xingqun Qi, Xiaowei Chi, Siyu Xia, Yan-Pei Cao, Wei Xue, Wenhan Luo, Yike Guo

TL;DR

PSHuman introduces a two-stage framework for photorealistic 3D human reconstruction from a single image by combining a body-face cross-scale diffusion model with SMPL-X conditioned guidance and an SMPLX-initialized explicit carving stage. The cross-scale diffusion generates consistent six-view full-body images and high-fidelity facial details, while the explicit carving enforces accurate geometry and texture via differentiable remeshing and texture fusion guided by multiview normals. Key contributions include the body-face diffusion with a noise-blending fusion layer, SMPL-X conditioned multiview diffusion to reduce self-occlusion artifacts, and a fast, texture-preserving mesh reconstruction pipeline that surpasses prior methods on THuman2.1 and CAPE. The approach achieves high-quality geometry and appearance in about one minute, offering practical impact for real-time 3D humanoid reconstruction in animation, AR/VR, and fashion applications, while acknowledging ethical considerations around potential misuse of realistic synthetic humans.

Abstract

Detailed and photorealistic 3D human modeling is essential for various applications and has seen tremendous progress. However, full-body reconstruction from a monocular RGB image remains challenging due to the ill-posed nature of the problem and sophisticated clothing topology with self-occlusions. In this paper, we propose PSHuman, a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model. It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions, especially on generated faces. To address it, we propose a cross-scale diffusion that models the joint probability distribution of global full-body shape and local facial characteristics, enabling detailed and identity-preserved novel-view generation without any geometric distortion. Moreover, to enhance cross-view body shape consistency of varied human poses, we condition the generative model on parametric models like SMPL-X, which provide body priors and prevent unnatural views inconsistent with human anatomy. Leveraging the generated multi-view normal and color images, we present SMPLX-initialized explicit human carving to recover realistic textured human meshes efficiently. Extensive experimental results and quantitative evaluations on CAPE and THuman2.1 datasets demonstrate PSHumans superiority in geometry details, texture fidelity, and generalization capability.

PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing

TL;DR

PSHuman introduces a two-stage framework for photorealistic 3D human reconstruction from a single image by combining a body-face cross-scale diffusion model with SMPL-X conditioned guidance and an SMPLX-initialized explicit carving stage. The cross-scale diffusion generates consistent six-view full-body images and high-fidelity facial details, while the explicit carving enforces accurate geometry and texture via differentiable remeshing and texture fusion guided by multiview normals. Key contributions include the body-face diffusion with a noise-blending fusion layer, SMPL-X conditioned multiview diffusion to reduce self-occlusion artifacts, and a fast, texture-preserving mesh reconstruction pipeline that surpasses prior methods on THuman2.1 and CAPE. The approach achieves high-quality geometry and appearance in about one minute, offering practical impact for real-time 3D humanoid reconstruction in animation, AR/VR, and fashion applications, while acknowledging ethical considerations around potential misuse of realistic synthetic humans.

Abstract

Detailed and photorealistic 3D human modeling is essential for various applications and has seen tremendous progress. However, full-body reconstruction from a monocular RGB image remains challenging due to the ill-posed nature of the problem and sophisticated clothing topology with self-occlusions. In this paper, we propose PSHuman, a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model. It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions, especially on generated faces. To address it, we propose a cross-scale diffusion that models the joint probability distribution of global full-body shape and local facial characteristics, enabling detailed and identity-preserved novel-view generation without any geometric distortion. Moreover, to enhance cross-view body shape consistency of varied human poses, we condition the generative model on parametric models like SMPL-X, which provide body priors and prevent unnatural views inconsistent with human anatomy. Leveraging the generated multi-view normal and color images, we present SMPLX-initialized explicit human carving to recover realistic textured human meshes efficiently. Extensive experimental results and quantitative evaluations on CAPE and THuman2.1 datasets demonstrate PSHumans superiority in geometry details, texture fidelity, and generalization capability.
Paper Structure (15 sections, 9 equations, 21 figures, 7 tables)

This paper contains 15 sections, 9 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: We introduce PSHuman, a diffusion-based full-body human reconstruction model. Given a single image of a clothed person, our method facilitates detailed geometry and realistic 3D human appearance across various poses within one minute.
  • Figure 2: Each triplet contains input (left) and reconstructions of w/o (middle) and w/ (right) SMPL-X condition. Compared with naive diffusion, SMPL-X prior guides handling self-occlusion and improving consistency.
  • Figure 3: (a)Overall pipeline. Given a single full-body human image, PSHuman recovers the texture human mesh by two stages: 1) Body-face enhanced and SMPL-X conditioned multiview generation. The input image and predicted SMPL-X are fed into a multiview image diffusion model to generate six views of global full-body images and front local face images. 2) SMPLX-initialized explicit human carving. Utilizing generated normal and color maps to deform and remesh the SMPL-X with differentiable rasterization. (b) Illustration of joint denoising diffusion block.
  • Figure 4: Illustration of our explicit human carving module.
  • Figure 5: Geometry comparison of PSHuman with Implicit and Explicit methods for 3D human inference from in-the-wild images. Existing methods often struggle with complex poses and loose clothing, leading to issues such as absent body parts, disrupted clothing, and a lack of fine details. In contrast, PSHuman provides a complete shape, detailed facial features, and natural-looking clothing folds. Following xiu2023econ, we substitute the hands with SMPL-X models to enhance visual quality.
  • ...and 16 more figures