Table of Contents
Fetching ...

GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior

Zichen Tang, Yuan Yao, Miaomiao Cui, Liefeng Bo, Hongyu Yang

TL;DR

This work addresses the challenge of generating identity-preserving, realistic 3D human avatars from text and image prompts. It introduces GaussianIP, a two-stage framework that couples 3D Gaussian Splatting with a human-centric diffusion prior. The first stage employs Adaptive Human Distillation Sampling (AHDS) to efficiently distill identity-relevant cues, while the second stage uses View-Consistent Refinement (VCR) to enhance facial and garment details with cross-view texture coherence. Empirical results show improved visual quality and faster training compared to state-of-the-art baselines, highlighting the practical impact for AR/VR and personalized digital humans; limitations suggest extending the approach to more complex poses and interactions in future work.

Abstract

Text-guided 3D human generation has advanced with the development of efficient 3D representations and 2D-lifting methods like Score Distillation Sampling (SDS). However, current methods suffer from prolonged training times and often produce results that lack fine facial and garment details. In this paper, we propose GaussianIP, an effective two-stage framework for generating identity-preserving realistic 3D humans from text and image prompts. Our core insight is to leverage human-centric knowledge to facilitate the generation process. In stage 1, we propose a novel Adaptive Human Distillation Sampling (AHDS) method to rapidly generate a 3D human that maintains high identity consistency with the image prompt and achieves a realistic appearance. Compared to traditional SDS methods, AHDS better aligns with the human-centric generation process, enhancing visual quality with notably fewer training steps. To further improve the visual quality of the face and clothes regions, we design a View-Consistent Refinement (VCR) strategy in stage 2. Specifically, it produces detail-enhanced results of the multi-view images from stage 1 iteratively, ensuring the 3D texture consistency across views via mutual attention and distance-guided attention fusion. Then a polished version of the 3D human can be achieved by directly perform reconstruction with the refined images. Extensive experiments demonstrate that GaussianIP outperforms existing methods in both visual quality and training efficiency, particularly in generating identity-preserving results. Our code is available at: https://github.com/silence-tang/GaussianIP.

GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior

TL;DR

This work addresses the challenge of generating identity-preserving, realistic 3D human avatars from text and image prompts. It introduces GaussianIP, a two-stage framework that couples 3D Gaussian Splatting with a human-centric diffusion prior. The first stage employs Adaptive Human Distillation Sampling (AHDS) to efficiently distill identity-relevant cues, while the second stage uses View-Consistent Refinement (VCR) to enhance facial and garment details with cross-view texture coherence. Empirical results show improved visual quality and faster training compared to state-of-the-art baselines, highlighting the practical impact for AR/VR and personalized digital humans; limitations suggest extending the approach to more complex poses and interactions in future work.

Abstract

Text-guided 3D human generation has advanced with the development of efficient 3D representations and 2D-lifting methods like Score Distillation Sampling (SDS). However, current methods suffer from prolonged training times and often produce results that lack fine facial and garment details. In this paper, we propose GaussianIP, an effective two-stage framework for generating identity-preserving realistic 3D humans from text and image prompts. Our core insight is to leverage human-centric knowledge to facilitate the generation process. In stage 1, we propose a novel Adaptive Human Distillation Sampling (AHDS) method to rapidly generate a 3D human that maintains high identity consistency with the image prompt and achieves a realistic appearance. Compared to traditional SDS methods, AHDS better aligns with the human-centric generation process, enhancing visual quality with notably fewer training steps. To further improve the visual quality of the face and clothes regions, we design a View-Consistent Refinement (VCR) strategy in stage 2. Specifically, it produces detail-enhanced results of the multi-view images from stage 1 iteratively, ensuring the 3D texture consistency across views via mutual attention and distance-guided attention fusion. Then a polished version of the 3D human can be achieved by directly perform reconstruction with the refined images. Extensive experiments demonstrate that GaussianIP outperforms existing methods in both visual quality and training efficiency, particularly in generating identity-preserving results. Our code is available at: https://github.com/silence-tang/GaussianIP.

Paper Structure

This paper contains 11 sections, 11 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the GaussianIP framework. We combine 3D Gaussian Splatting (3DGS) with a human-centric diffusion prior to realize high-fidelity 3D human avatar generation. (a) We initialize 3D human Gaussians by densely sample from a SMPL-X mesh. Afterward, (b) a human-centric diffusion model is combined with a pose-guide ControlNet to produce AHDS guidance. The AHDS guidance consists of an HDS guidance, which is proposed to achieve better identity-preserving generation, and an Adaptive Human-specific Timestep Scheduling strategy, which accelerates the HDS training. Furthermore, we propose (c) a View-Consistent Refinement Mechanism to further enhance the delicate texture of faces and garments. We guide the denoising of key views $\boldsymbol{x}_0^P$ with attention features from main views $\boldsymbol{x}_0^M$ through Mutual Attention. Next, we align the denoising of an intermediate view $\boldsymbol{x}_0^I$ with that of its neighbor key views via distance-guided attention fusion. Finally, the refined multi-view images are leveraged to optimize the current 3DGS.
  • Figure 2: Illustration of the optimized weight PDF for sampling timesteps and the corresponding timestep vs. training step (t-i) curve. a) Phase 1, 3 occupy the majority of the training steps, while Phase 2 occupies only a small portion, allowing a quick transition to the detailed texture learning in Phase 3. b) We sample the final timestep between the lower bound and $t_{\text{DG}}$ for each phase. Note that for the geometry phase ($i<500$), we sample between 500 and the maximum timestep to ensure a smooth start.
  • Figure 3: Qualitative comparison results with SOTA text-guided 3D human generation models. Please zoom in for better observation. Note that the baselines cannot handle image prompts, so we compare their text-to-3D results instead. Due to space limitations, please refer to the supplementary materials for the video comparison results.
  • Figure 4: Ablation studies on various module designs. We present the generation results of the human frontal view under four ablation settings: (a) baseline; (b) + HDS; (c) + AHDS; (d) + View-consistent Refinement Mechanism. Detailed ablation settings and result analysis are depicted in Sec. \ref{['sec:4.3']}.
  • Figure 5: Ablation study on our VCR module. When the multi-view images are denoised independently, the results will loss cross-view texture consistency. In contrast, images refined with our VCR module maintain high 3D texture consistency.