Table of Contents
Fetching ...

HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration

Boyuan Wang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Xiaopei Zhang, Guan Huang, Yijie Ren, Lihong Liu, Xingang Wang

TL;DR

HumanDreamer-X tackles single-image 3D human avatar reconstruction by unifying geometry-aware 3D Gaussian Splatting (3DGS) with a video diffusion-based restoration stage (HumanFixer). The approach initializes a coarse avatar from a single image, renders multi-view priors, then refines these views to photorealism and geometric fidelity, guided by a temporal attention modulation strategy to preserve consistency across views. Quantitatively, it achieves substantial PSNR gains in both generation (≈16.45%) and reconstruction (≈12.65%), reaching up to 25.62 dB, and demonstrates strong generalization to in-the-wild data and compatibility with multiple 3DGS backbones. The work advances practical, high-fidelity single-image avatar creation with a robust, end-to-end pipeline that mitigates view-inconsistency artifacts and supports diverse downstream applications.

Abstract

Single-image human reconstruction is vital for digital human modeling applications but remains an extremely challenging task. Current approaches rely on generative models to synthesize multi-view images for subsequent 3D reconstruction and animation. However, directly generating multiple views from a single human image suffers from geometric inconsistencies, resulting in issues like fragmented or blurred limbs in the reconstructed models. To tackle these limitations, we introduce \textbf{HumanDreamer-X}, a novel framework that integrates multi-view human generation and reconstruction into a unified pipeline, which significantly enhances the geometric consistency and visual fidelity of the reconstructed 3D models. In this framework, 3D Gaussian Splatting serves as an explicit 3D representation to provide initial geometry and appearance priority. Building upon this foundation, \textbf{HumanFixer} is trained to restore 3DGS renderings, which guarantee photorealistic results. Furthermore, we delve into the inherent challenges associated with attention mechanisms in multi-view human generation, and propose an attention modulation strategy that effectively enhances geometric details identity consistency across multi-view. Experimental results demonstrate that our approach markedly improves generation and reconstruction PSNR quality metrics by 16.45% and 12.65%, respectively, achieving a PSNR of up to 25.62 dB, while also showing generalization capabilities on in-the-wild data and applicability to various human reconstruction backbone models.

HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration

TL;DR

HumanDreamer-X tackles single-image 3D human avatar reconstruction by unifying geometry-aware 3D Gaussian Splatting (3DGS) with a video diffusion-based restoration stage (HumanFixer). The approach initializes a coarse avatar from a single image, renders multi-view priors, then refines these views to photorealism and geometric fidelity, guided by a temporal attention modulation strategy to preserve consistency across views. Quantitatively, it achieves substantial PSNR gains in both generation (≈16.45%) and reconstruction (≈12.65%), reaching up to 25.62 dB, and demonstrates strong generalization to in-the-wild data and compatibility with multiple 3DGS backbones. The work advances practical, high-fidelity single-image avatar creation with a robust, end-to-end pipeline that mitigates view-inconsistency artifacts and supports diverse downstream applications.

Abstract

Single-image human reconstruction is vital for digital human modeling applications but remains an extremely challenging task. Current approaches rely on generative models to synthesize multi-view images for subsequent 3D reconstruction and animation. However, directly generating multiple views from a single human image suffers from geometric inconsistencies, resulting in issues like fragmented or blurred limbs in the reconstructed models. To tackle these limitations, we introduce \textbf{HumanDreamer-X}, a novel framework that integrates multi-view human generation and reconstruction into a unified pipeline, which significantly enhances the geometric consistency and visual fidelity of the reconstructed 3D models. In this framework, 3D Gaussian Splatting serves as an explicit 3D representation to provide initial geometry and appearance priority. Building upon this foundation, \textbf{HumanFixer} is trained to restore 3DGS renderings, which guarantee photorealistic results. Furthermore, we delve into the inherent challenges associated with attention mechanisms in multi-view human generation, and propose an attention modulation strategy that effectively enhances geometric details identity consistency across multi-view. Experimental results demonstrate that our approach markedly improves generation and reconstruction PSNR quality metrics by 16.45% and 12.65%, respectively, achieving a PSNR of up to 25.62 dB, while also showing generalization capabilities on in-the-wild data and applicability to various human reconstruction backbone models.

Paper Structure

This paper contains 14 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of HumanDreamer-X. The pipeline initiates with a single-image reconstruction to generate a coarse 3D avatar, providing essential geometric and appearance priority for the restoration process. This approach facilitates the attainment of a higher-quality 3D avatar, suitable for subsequent animation tasks.
  • Figure 2: Overall framework of the proposed HumanDreamer-X. The process begins by initializing a coarse 3DGS avatar using a reference image. A rendered video serves as a , providing geometric and appearance priors. Subsequently, HumanFixer performs video restoration, wherein an attention modulation is employed to enhance video consistency. Throughout this process, the restored video is used to continuously update the 3DGS model, ultimately resulting in a refined 3DGS avatar.
  • Figure 3: The creation of the dataset for training HumanFixer. First, we use Blender to render scans and obtain the ground truth video. Next, we employ the frontal image and its corresponding SMPL prior to reconstruct a coarse 3DGS model, followed by rendering multi-view videos. This process yields paired video data for training.
  • Figure 4: Attention weights visualization. The left and right sides show the head 0 attention weights at the temporal self-attention stage for training on cyclic videos without and with an attention modulation, respectively. Brighter colors indicate higher weights.
  • Figure 5: Comparison of generation with SOTA methods. Note that PSHuman's training dataset contains all of the CustomHumans. Best viewed with zoom-in.
  • ...and 1 more figures