Table of Contents
Fetching ...

GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

Wentao Wang, Hang Ye, Fangzhou Hong, Xue Yang, Jianfu Zhang, Yizhou Wang, Ziwei Liu, Liang Pan

TL;DR

GeneMAN targets robust single-image 3D human reconstruction from in-the-wild photos by building human-specific priors from a large, multi-source dataset and integrating them into a template-free geometry and texture refinement pipeline. It leverages a 2D diffusion prior for appearance and a 3D view-conditioned diffusion prior for geometry to guide NeRF→DMTet geometry initialization and multi-space texture synthesis, achieving strong generalization across clothing, poses, and personal belongings. The approach delivers state-of-the-art quantitative and qualitative results on challenging data and is validated by a large user study, though it incurs longer per-subject optimization times than feed-forward methods. Overall, GeneMAN advances practical single-view 3D human reconstruction by combining rich priors with an end-to-end geometry-plus-texture synthesis framework, enabling high-fidelity reconstructions in diverse real-world scenarios.

Abstract

Given a single in-the-wild human photo, it remains a challenging task to reconstruct a high-fidelity 3D human model. Existing methods face difficulties including a) the varying body proportions captured by in-the-wild human images; b) diverse personal belongings within the shot; and c) ambiguities in human postures and inconsistency in human textures. In addition, the scarcity of high-quality human data intensifies the challenge. To address these problems, we propose a Generalizable image-to-3D huMAN reconstruction framework, dubbed GeneMAN, building upon a comprehensive multi-source collection of high-quality human data, including 3D scans, multi-view videos, single photos, and our generated synthetic human data. GeneMAN encompasses three key modules. 1) Without relying on parametric human models (e.g., SMPL), GeneMAN first trains a human-specific text-to-image diffusion model and a view-conditioned diffusion model, serving as GeneMAN 2D human prior and 3D human prior for reconstruction, respectively. 2) With the help of the pretrained human prior models, the Geometry Initialization-&-Sculpting pipeline is leveraged to recover high-quality 3D human geometry given a single image. 3) To achieve high-fidelity 3D human textures, GeneMAN employs the Multi-Space Texture Refinement pipeline, consecutively refining textures in the latent and the pixel spaces. Extensive experimental results demonstrate that GeneMAN could generate high-quality 3D human models from a single image input, outperforming prior state-of-the-art methods. Notably, GeneMAN could reveal much better generalizability in dealing with in-the-wild images, often yielding high-quality 3D human models in natural poses with common items, regardless of the body proportions in the input images.

GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

TL;DR

GeneMAN targets robust single-image 3D human reconstruction from in-the-wild photos by building human-specific priors from a large, multi-source dataset and integrating them into a template-free geometry and texture refinement pipeline. It leverages a 2D diffusion prior for appearance and a 3D view-conditioned diffusion prior for geometry to guide NeRF→DMTet geometry initialization and multi-space texture synthesis, achieving strong generalization across clothing, poses, and personal belongings. The approach delivers state-of-the-art quantitative and qualitative results on challenging data and is validated by a large user study, though it incurs longer per-subject optimization times than feed-forward methods. Overall, GeneMAN advances practical single-view 3D human reconstruction by combining rich priors with an end-to-end geometry-plus-texture synthesis framework, enabling high-fidelity reconstructions in diverse real-world scenarios.

Abstract

Given a single in-the-wild human photo, it remains a challenging task to reconstruct a high-fidelity 3D human model. Existing methods face difficulties including a) the varying body proportions captured by in-the-wild human images; b) diverse personal belongings within the shot; and c) ambiguities in human postures and inconsistency in human textures. In addition, the scarcity of high-quality human data intensifies the challenge. To address these problems, we propose a Generalizable image-to-3D huMAN reconstruction framework, dubbed GeneMAN, building upon a comprehensive multi-source collection of high-quality human data, including 3D scans, multi-view videos, single photos, and our generated synthetic human data. GeneMAN encompasses three key modules. 1) Without relying on parametric human models (e.g., SMPL), GeneMAN first trains a human-specific text-to-image diffusion model and a view-conditioned diffusion model, serving as GeneMAN 2D human prior and 3D human prior for reconstruction, respectively. 2) With the help of the pretrained human prior models, the Geometry Initialization-&-Sculpting pipeline is leveraged to recover high-quality 3D human geometry given a single image. 3) To achieve high-fidelity 3D human textures, GeneMAN employs the Multi-Space Texture Refinement pipeline, consecutively refining textures in the latent and the pixel spaces. Extensive experimental results demonstrate that GeneMAN could generate high-quality 3D human models from a single image input, outperforming prior state-of-the-art methods. Notably, GeneMAN could reveal much better generalizability in dealing with in-the-wild images, often yielding high-quality 3D human models in natural poses with common items, regardless of the body proportions in the input images.

Paper Structure

This paper contains 27 sections, 9 equations, 25 figures, 6 tables.

Figures (25)

  • Figure 1: GeneMAN is a generalizable framework for single-view-to-3D human reconstruction, built on a collection of multi-source human data. Given a single in-the-wild image of a person, GeneMAN could reconstruct a high-quality 3D human model, regardless of its clothing, pose, or body proportions (e.g., a full-body, a half-body, or a close-up shot) in the given image. The anonymous project page of GeneMAN is: https://roooooz.github.io/GeneMAN/.
  • Figure 2: Overview of the Multi-Source Human Dataset and Our GeneMAN Pipeline. We have constructed a multi-source human dataset comprising 3D scans, videos, 2D images, and synthetic data. This dataset is utilized to train human-specific 2D and 3D prior models, which provide generalizable geometric and texture priors for our GeneMAN framework. Through geometry initialization, sculpting, and multi-space texture refinement in GeneMAN, we achieve high-fidelity 3D human body reconstruction from single in-the-wild images.
  • Figure 3: Geometry Initialization $\&$ Sculpting. During the geometry reconstruction stage, we initialize a template-free geometry using NeRF mildenhall2021nerf, incorporating GeneMAN 2D and 3D priors with SDS losses. Alongside diffusion-based guidance, a reference loss ensures alignment with the input image. We then convert NeRF into DMTet shen2021dmtet for high-resolution refinement, guided by pretrained human-specific normal- and depth-adapted diffusion models huang2024humannorm.
  • Figure 4: Multi-Space Texture Refinement. In the texture generation stage, we propose multi-space texture refinement to optimize texture in both latent space and pixel space. First, we generate the coarse textures using multi-view texturing, which are then iteratively refined in latent space. Subsequently, detailed textures are obtained by optimizing the UV map in pixel space with a 2D prior-based ControlNet.
  • Figure 5: Qualitative Results of GeneMAN with Complex Poses.
  • ...and 20 more figures