Table of Contents
Fetching ...

MVD-HuGaS: Human Gaussians from a Single Image via 3D Human Multi-view Diffusion Prior

Kaiqiang Xiong, Ying Feng, Qi Zhang, Jianbo Jiao, Yang Zhao, Zhihao Liang, Huachen Gao, Ronggang Wang

TL;DR

MVD-HuGaS tackles single-view 3D clothed human reconstruction by leveraging a human-aware multi-view diffusion prior to generate coherent multi-view imagery from a single image. It introduces SV3D-Human, fine-tuned on 2K2K, to inject 3D human priors; a Camera Alignment Module jointly optimizes 3D Gaussians and camera poses to recover accurate extrinsics; a depth-based Face Distortion Mitigation module restores facial regions via multi-view 3DMM fitting and depth-guided warping; and finally 3D Gaussians enable high-fidelity, real-time free-view rendering. The approach achieves state-of-the-art results on Thuman2.0 and 2K2K in PSNR, SSIM, LPIPS, and CLIP-Sim metrics, while ablation studies confirm the necessity of each component. This work provides a practical pipeline for robust single-image 3D human reconstruction with strong generalization to wild imagery, enabling realistic avatars for AR/VR, gaming, and film production.

Abstract

3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating one back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textit{e.g.} flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emph{MVD-HuGaS}, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction.Finally, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.

MVD-HuGaS: Human Gaussians from a Single Image via 3D Human Multi-view Diffusion Prior

TL;DR

MVD-HuGaS tackles single-view 3D clothed human reconstruction by leveraging a human-aware multi-view diffusion prior to generate coherent multi-view imagery from a single image. It introduces SV3D-Human, fine-tuned on 2K2K, to inject 3D human priors; a Camera Alignment Module jointly optimizes 3D Gaussians and camera poses to recover accurate extrinsics; a depth-based Face Distortion Mitigation module restores facial regions via multi-view 3DMM fitting and depth-guided warping; and finally 3D Gaussians enable high-fidelity, real-time free-view rendering. The approach achieves state-of-the-art results on Thuman2.0 and 2K2K in PSNR, SSIM, LPIPS, and CLIP-Sim metrics, while ablation studies confirm the necessity of each component. This work provides a practical pipeline for robust single-image 3D human reconstruction with strong generalization to wild imagery, enabling realistic avatars for AR/VR, gaming, and film production.

Abstract

3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating one back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textit{e.g.} flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emph{MVD-HuGaS}, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction.Finally, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.

Paper Structure

This paper contains 14 sections, 5 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Given a single image, the proposed MVD-HuGaS generates 3D Gaussians enabling high-fidelity free-view 3D clothed human Novel View Synthesis.
  • Figure 2: Camera Misalignment Issue. (a) 3D-GS rendering results from generated multi-view images with conditioned camera poses. (b) 3D-GS rendering results from multi-view images with the optimized camera poses via our Camera Alignment Module.
  • Figure 3: Facial Distortion Issue. Despite being fine-tuned on 3D human datasets, models still struggle to generate multi-view consistent and plausible human faces, even though they can learn the structural priors of the human body.
  • Figure 4: Framework of MVD-HuGaS. The framework consists of four modules: (a) SV3D-Human: Starting with an input reference image, our SV3D-Human model initiates the process by generating multi-view images. These images are produced with misaligned camera poses and facial distortions. (b) Camera Alignment Module: Our camera alignment module works to jointly optimize the 3D Gaussians and the misaligned camera poses. This is achieved by minimizing the reconstruction loss between the rendered 3D Gaussians and the multi-view images. After optimization, we obtain camera poses that are precisely aligned with the generated multi-view images. (c) Face Distortion Mitigation: To mitigate the facial distortion in the generated source images, we first apply the aligned camera poses and conduct multi-view 3D Morphable Model (3DMM) fitting. This process yields a human face mesh with accurate shape and location in 3D space. The mesh is then rendered back to the reference view to derive the depth of the reference face. Utilizing the reference RGB image and its corresponding depth map, the faces in the source views are then reconstructed through depth-based forward warping. (d) 3D-GS Reconstruction: Armed with the accurate camera poses from step (b), the refined multi-view images from step (c), and the input reference image, we reuse the 3D Gaussians to present and reconstruct the target 3D human, enabling high-fidelity real-time free-view renderings.
  • Figure 5: Different Depth Estimation Comparison. MVS methods infer noisy depth maps due to the inherent inconsistency and distortion of generated multi-view face parts. Vanilla 3DMM fitting methods can only estimate face geometry and relative depth information from a single image, but they often fail to accurately capture the absolute spatial positions. In contrast, the proposed Multi-view 3DMM Fitting can accurately estimate the absolute depth of the face, including both facial geometry and 3D spatial information.
  • ...and 3 more figures