Table of Contents
Fetching ...

SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction

Wenyue Chen, Peng Li, Wangguandong Zheng, Chengfeng Zhao, Mengfei Li, Yaolong Zhu, Zhiyang Dou, Ronggang Wang, Yuan Liu

TL;DR

SyncHuman tackles single-view 3D clothed human reconstruction by unifying a 2D multiview diffusion model with a 3D native diffusion model through 2D-3D synchronization attention and a Multiview Guided Decoder. The cross-space framework enables mutual refinement between 2D detailed textures and 3D structural fidelity, producing high-quality textured meshes even for challenging poses. Across extensive experiments, it surpasses prior single-view methods in geometry and appearance and can outperform some large-scale 3D generators trained on much larger datasets. This approach offers a robust, scalable direction for diffusion-based 3D human generation from single images, with broad implications for AR/VR and content creation.

Abstract

Photorealistic 3D full-body human reconstruction from a single image is a critical yet challenging task for applications in films and video games due to inherent ambiguities and severe self-occlusions. While recent approaches leverage SMPL estimation and SMPL-conditioned image generative models to hallucinate novel views, they suffer from inaccurate 3D priors estimated from SMPL meshes and have difficulty in handling difficult human poses and reconstructing fine details. In this paper, we propose SyncHuman, a novel framework that combines 2D multiview generative model and 3D native generative model for the first time, enabling high-quality clothed human mesh reconstruction from single-view images even under challenging human poses. Multiview generative model excels at capturing fine 2D details but struggles with structural consistency, whereas 3D native generative model generates coarse yet structurally consistent 3D shapes. By integrating the complementary strengths of these two approaches, we develop a more effective generation framework. Specifically, we first jointly fine-tune the multiview generative model and the 3D native generative model with proposed pixel-aligned 2D-3D synchronization attention to produce geometrically aligned 3D shapes and 2D multiview images. To further improve details, we introduce a feature injection mechanism that lifts fine details from 2D multiview images onto the aligned 3D shapes, enabling accurate and high-fidelity reconstruction. Extensive experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models.

SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction

TL;DR

SyncHuman tackles single-view 3D clothed human reconstruction by unifying a 2D multiview diffusion model with a 3D native diffusion model through 2D-3D synchronization attention and a Multiview Guided Decoder. The cross-space framework enables mutual refinement between 2D detailed textures and 3D structural fidelity, producing high-quality textured meshes even for challenging poses. Across extensive experiments, it surpasses prior single-view methods in geometry and appearance and can outperform some large-scale 3D generators trained on much larger datasets. This approach offers a robust, scalable direction for diffusion-based 3D human generation from single images, with broad implications for AR/VR and content creation.

Abstract

Photorealistic 3D full-body human reconstruction from a single image is a critical yet challenging task for applications in films and video games due to inherent ambiguities and severe self-occlusions. While recent approaches leverage SMPL estimation and SMPL-conditioned image generative models to hallucinate novel views, they suffer from inaccurate 3D priors estimated from SMPL meshes and have difficulty in handling difficult human poses and reconstructing fine details. In this paper, we propose SyncHuman, a novel framework that combines 2D multiview generative model and 3D native generative model for the first time, enabling high-quality clothed human mesh reconstruction from single-view images even under challenging human poses. Multiview generative model excels at capturing fine 2D details but struggles with structural consistency, whereas 3D native generative model generates coarse yet structurally consistent 3D shapes. By integrating the complementary strengths of these two approaches, we develop a more effective generation framework. Specifically, we first jointly fine-tune the multiview generative model and the 3D native generative model with proposed pixel-aligned 2D-3D synchronization attention to produce geometrically aligned 3D shapes and 2D multiview images. To further improve details, we introduce a feature injection mechanism that lifts fine details from 2D multiview images onto the aligned 3D shapes, enabling accurate and high-fidelity reconstruction. Extensive experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models.

Paper Structure

This paper contains 25 sections, 22 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: We introduce SyncHuman, a full-body human reconstruction model using synchronized 2D and 3D diffusion model. Given a single image of a clothed person, our method generates detailed geometry and lifelike 3D human appearances across diverse poses.
  • Figure 2: Geometric comparison between SMPL estimation patel2024camerahmr, 2D multiview generative model (MVD) PSHuman li2024pshuman, native 3D generative model Trellis xiang2024structured and our method. 2D MVD produces high-quality details but has geometry artifacts when conditioned on inaccurate SMPL meshes. Native 3D generative model produces correct coarse structure but loses fine details and fidelity. Our method combines the strengths of both 2D and 3D generative models to produce detailed 3D human meshes with high fidelity.
  • Figure 3: Overview. Given a single human image, SyncHuman first generates multiview color and normal maps, along with an aligned sparse voxel grid, which is further transformed into a set of structured latents. Then, we propose to inject the high-quality images into the 3D latents via a Multiview Guided Decoder and output the detailed high-fidelity textured human mesh.
  • Figure 4: 2D-3D synchronization attention.2D to 3D attention: each 3D voxel feature is orthogonally projected onto front, back, left, and right view planes to retrieve corresponding 2D features, and refines the voxel feature with cross-attention. 3D to 2D attention: each 2D multiview feature is projected into 3D space to attend to a column of voxel features, enhancing the 2D features. This mutual refinement ensures that 2D generative model and 3D generative model align with each other in a shared 3D space.
  • Figure 5: Geometry comparisons between ECON visrecon, Human3Diff xue2024human3diff, SIFU zhang2024sifu, PSHuman li2024pshuman and ours. Our method could reconstruct 3D shapes with complete body structure and rich details.
  • ...and 10 more figures