MVD-HuGaS: Human Gaussians from a Single Image via 3D Human Multi-view Diffusion Prior
Kaiqiang Xiong, Ying Feng, Qi Zhang, Jianbo Jiao, Yang Zhao, Zhihao Liang, Huachen Gao, Ronggang Wang
TL;DR
MVD-HuGaS tackles single-view 3D clothed human reconstruction by leveraging a human-aware multi-view diffusion prior to generate coherent multi-view imagery from a single image. It introduces SV3D-Human, fine-tuned on 2K2K, to inject 3D human priors; a Camera Alignment Module jointly optimizes 3D Gaussians and camera poses to recover accurate extrinsics; a depth-based Face Distortion Mitigation module restores facial regions via multi-view 3DMM fitting and depth-guided warping; and finally 3D Gaussians enable high-fidelity, real-time free-view rendering. The approach achieves state-of-the-art results on Thuman2.0 and 2K2K in PSNR, SSIM, LPIPS, and CLIP-Sim metrics, while ablation studies confirm the necessity of each component. This work provides a practical pipeline for robust single-image 3D human reconstruction with strong generalization to wild imagery, enabling realistic avatars for AR/VR, gaming, and film production.
Abstract
3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating one back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textit{e.g.} flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emph{MVD-HuGaS}, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction.Finally, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.
