Table of Contents
Fetching ...

MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement

Xu He, Xiaoyu Li, Di Kang, Jiangnan Ye, Chaopeng Zhang, Liyang Chen, Xiangjun Gao, Han Zhang, Zhiyong Wu, Haolin Zhuang

TL;DR

MagicMan tackles the challenge of generating dense, 3D-consistent human views from a single image by fusing a pre-trained 2D diffusion prior with a 3D body prior via SMPL-X. It introduces a hybrid multi-view attention mechanism to exchange information across views efficiently, and a geometry-aware dual-branch setup to jointly synthesize RGB images and normals. An iterative refinement loop progressively improves SMPL-X pose accuracy and multi-view consistency, reducing ill-shaped artifacts caused by monocular pose estimation. Empirically, MagicMan achieves state-of-the-art performance in both novel-view synthesis and 3D human reconstruction, outperforming strong baselines on THuman2.1, CustomHumans, and in-the-wild data, with high-quality dense view generation suitable for downstream reconstruction and animation applications.

Abstract

Existing works in single-image human reconstruction suffer from weak generalizability due to insufficient training data or 3D inconsistencies for a lack of comprehensive multi-view knowledge. In this paper, we introduce MagicMan, a human-specific multi-view diffusion model designed to generate high-quality novel view images from a single reference image. As its core, we leverage a pre-trained 2D diffusion model as the generative prior for generalizability, with the parametric SMPL-X model as the 3D body prior to promote 3D awareness. To tackle the critical challenge of maintaining consistency while achieving dense multi-view generation for improved 3D human reconstruction, we first introduce hybrid multi-view attention to facilitate both efficient and thorough information interchange across different views. Additionally, we present a geometry-aware dual branch to perform concurrent generation in both RGB and normal domains, further enhancing consistency via geometry cues. Last but not least, to address ill-shaped issues arising from inaccurate SMPL-X estimation that conflicts with the reference image, we propose a novel iterative refinement strategy, which progressively optimizes SMPL-X accuracy while enhancing the quality and consistency of the generated multi-views. Extensive experimental results demonstrate that our method significantly outperforms existing approaches in both novel view synthesis and subsequent 3D human reconstruction tasks.

MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement

TL;DR

MagicMan tackles the challenge of generating dense, 3D-consistent human views from a single image by fusing a pre-trained 2D diffusion prior with a 3D body prior via SMPL-X. It introduces a hybrid multi-view attention mechanism to exchange information across views efficiently, and a geometry-aware dual-branch setup to jointly synthesize RGB images and normals. An iterative refinement loop progressively improves SMPL-X pose accuracy and multi-view consistency, reducing ill-shaped artifacts caused by monocular pose estimation. Empirically, MagicMan achieves state-of-the-art performance in both novel-view synthesis and 3D human reconstruction, outperforming strong baselines on THuman2.1, CustomHumans, and in-the-wild data, with high-quality dense view generation suitable for downstream reconstruction and animation applications.

Abstract

Existing works in single-image human reconstruction suffer from weak generalizability due to insufficient training data or 3D inconsistencies for a lack of comprehensive multi-view knowledge. In this paper, we introduce MagicMan, a human-specific multi-view diffusion model designed to generate high-quality novel view images from a single reference image. As its core, we leverage a pre-trained 2D diffusion model as the generative prior for generalizability, with the parametric SMPL-X model as the 3D body prior to promote 3D awareness. To tackle the critical challenge of maintaining consistency while achieving dense multi-view generation for improved 3D human reconstruction, we first introduce hybrid multi-view attention to facilitate both efficient and thorough information interchange across different views. Additionally, we present a geometry-aware dual branch to perform concurrent generation in both RGB and normal domains, further enhancing consistency via geometry cues. Last but not least, to address ill-shaped issues arising from inaccurate SMPL-X estimation that conflicts with the reference image, we propose a novel iterative refinement strategy, which progressively optimizes SMPL-X accuracy while enhancing the quality and consistency of the generated multi-views. Extensive experimental results demonstrate that our method significantly outperforms existing approaches in both novel view synthesis and subsequent 3D human reconstruction tasks.
Paper Structure (22 sections, 4 equations, 17 figures, 6 tables, 1 algorithm)

This paper contains 22 sections, 4 equations, 17 figures, 6 tables, 1 algorithm.

Figures (17)

  • Figure 1: Typical ill-shaped geometry caused by inaccurate SMPL-X poses in recent 3D human reconstruction works ICON xiu2022icon and ECON xiu2023econ: 1) abnormal overall poses like tilted body (upper); 2) broken and distorted body parts (lower). Our proposed iterative refinement significantly improves the accuracy of SMPL-X poses, alleviating ill-shaped problems.
  • Figure 2: Given a single human image, our proposed MagicMan utilizes a pre-trained 2D diffusion model with a 3D human body prior to generate novel view images for humans. First, the reference image is fed into the denoising UNet via a reference UNet, with the viewpoint condition incorporated through camera embeddings. The rendered normal and segmentation maps of the posed SMPL-X mesh that corresponds to the reference image are also provided as geometry guidance to facilitate 3D awareness and consistency. To obtain dense and consistent novel view images, we modify the attention module to a more efficient hybrid 1D-3D attention (a) to establish comprehensive connections between multi-views, and propose a geometry-aware dual branch (b) to also generate normal images complementary to RGB images via geometry cues. Last but not least, a novel iterative refinement strategy (c) is proposed in the inference stage to gradually update the initially estimated inaccurate SMPL-X pose and the synthesized novel view images, substantially reducing the ill-shaped issues arising from unreliable SMPL-X estimates.
  • Figure 3: Qualitative results of novel view synthesis. MagicMan generates the highest-quality dense novel view images of humans with better consistency. * Methods are finetuned on THuman2.1 dataset for fair comparison. Please zoom in for details.
  • Figure 4: Reconstructed 3D human meshes. MagicMan produces the best geometry and textures in the case with challenging poses and loose outfits. Please zoom in for details.
  • Figure 5: Ablations on hybrid attention and dual branch. Our full model presents the best multi-view consistency.
  • ...and 12 more figures