Table of Contents
Fetching ...

Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior

Yiqian Wu, Hao Xu, Xiangjun Tang, Xien Chen, Siyu Tang, Zhebin Zhang, Chen Li, Xiaogang Jin

TL;DR

Portrait3D tackles the challenge of text-guided 3D portrait generation by introducing a joint geometry-appearance prior via a pyramid tri-grid 3D representation and a 3D-aware GAN, 3DPortraitGAN. This prior initializes a diffusion-based text-to-3D pipeline, where score distillation sampling transfers diffusion knowledge into the pyramid tri-grid, followed by diffusion-based refinement of rendered views and subsequent optimization of the grid. The approach mitigates grid-like artifacts and Janus failure while enabling canonical, high-quality, view-consistent 3D portraits that align with prompts. Compared to state-of-the-art baselines, Portrait3D demonstrates superior qualitative realism, quantitative alignment (FID/CLIP), and robust handling of diverse appearance attributes, while remaining practical on consumer GPU hardware.

Abstract

Existing neural rendering-based text-to-3D-portrait generation methods typically make use of human geometry prior and diffusion models to obtain guidance. However, relying solely on geometry information introduces issues such as the Janus problem, over-saturation, and over-smoothing. We present Portrait3D, a novel neural rendering-based framework with a novel joint geometry-appearance prior to achieve text-to-3D-portrait generation that overcomes the aforementioned issues. To accomplish this, we train a 3D portrait generator, 3DPortraitGAN-Pyramid, as a robust prior. This generator is capable of producing 360° canonical 3D portraits, serving as a starting point for the subsequent diffusion-based generation process. To mitigate the "grid-like" artifact caused by the high-frequency information in the feature-map-based 3D representation commonly used by most 3D-aware GANs, we integrate a novel pyramid tri-grid 3D representation into 3DPortraitGAN-Pyramid. To generate 3D portraits from text, we first project a randomly generated image aligned with the given prompt into the pre-trained 3DPortraitGAN-Pyramid's latent space. The resulting latent code is then used to synthesize a pyramid tri-grid. Beginning with the obtained pyramid tri-grid, we use score distillation sampling to distill the diffusion model's knowledge into the pyramid tri-grid. Following that, we utilize the diffusion model to refine the rendered images of the 3D portrait and then use these refined images as training data to further optimize the pyramid tri-grid, effectively eliminating issues with unrealistic color and unnatural artifacts. Our experimental results show that Portrait3D can produce realistic, high-quality, and canonical 3D portraits that align with the prompt.

Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior

TL;DR

Portrait3D tackles the challenge of text-guided 3D portrait generation by introducing a joint geometry-appearance prior via a pyramid tri-grid 3D representation and a 3D-aware GAN, 3DPortraitGAN. This prior initializes a diffusion-based text-to-3D pipeline, where score distillation sampling transfers diffusion knowledge into the pyramid tri-grid, followed by diffusion-based refinement of rendered views and subsequent optimization of the grid. The approach mitigates grid-like artifacts and Janus failure while enabling canonical, high-quality, view-consistent 3D portraits that align with prompts. Compared to state-of-the-art baselines, Portrait3D demonstrates superior qualitative realism, quantitative alignment (FID/CLIP), and robust handling of diverse appearance attributes, while remaining practical on consumer GPU hardware.

Abstract

Existing neural rendering-based text-to-3D-portrait generation methods typically make use of human geometry prior and diffusion models to obtain guidance. However, relying solely on geometry information introduces issues such as the Janus problem, over-saturation, and over-smoothing. We present Portrait3D, a novel neural rendering-based framework with a novel joint geometry-appearance prior to achieve text-to-3D-portrait generation that overcomes the aforementioned issues. To accomplish this, we train a 3D portrait generator, 3DPortraitGAN-Pyramid, as a robust prior. This generator is capable of producing 360° canonical 3D portraits, serving as a starting point for the subsequent diffusion-based generation process. To mitigate the "grid-like" artifact caused by the high-frequency information in the feature-map-based 3D representation commonly used by most 3D-aware GANs, we integrate a novel pyramid tri-grid 3D representation into 3DPortraitGAN-Pyramid. To generate 3D portraits from text, we first project a randomly generated image aligned with the given prompt into the pre-trained 3DPortraitGAN-Pyramid's latent space. The resulting latent code is then used to synthesize a pyramid tri-grid. Beginning with the obtained pyramid tri-grid, we use score distillation sampling to distill the diffusion model's knowledge into the pyramid tri-grid. Following that, we utilize the diffusion model to refine the rendered images of the 3D portrait and then use these refined images as training data to further optimize the pyramid tri-grid, effectively eliminating issues with unrealistic color and unnatural artifacts. Our experimental results show that Portrait3D can produce realistic, high-quality, and canonical 3D portraits that align with the prompt.
Paper Structure (21 sections, 4 equations, 9 figures, 1 table)

This paper contains 21 sections, 4 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: The results of score distillation sampling for 3D content generation (top) and texture generation (bottom), using positional encoding with a different number of frequencies (a,b), and multi-resolution hash encoding (c). The same prompt, "a hamburger", was used for a fair comparison.
  • Figure 2: The architecture of the 3D-aware pyramid tri-grid generator in 3DPortraitGAN. The pyramid tri-grid is composed of tri-grids generated at different layers. For the sake of simplicity and clarity, we omit the latent code modulation applied to each block.
  • Figure 3: The 3D portrait generation pipeline of Portrait3D. The "" denotes that the submodule or representation is frozen.
  • Figure 4: Qualitative comparison to SOTA text-to-3D approaches: DreamFusion DBLP:conf/iclr/PooleJBM23, LucidDreamer DBLP:journals/corr/abs-2311-11284, TADA DBLP:journals/corr/abs-2308-10899, AvatarCraft Jiang_2023_ICCV, AvatarStudio DBLP:journals/corr/abs-2311-17917, HumanGaussian DBLP:journals/corr/abs-2311-17061, AvatarVerse DBLP:journals/corr/abs-2308-03610, HumanNorm DBLP:journals/corr/abs-2310-01406, SEEAvatar xu2023seeavatar, TECA zhang2023textguided, and our method. The input prompt is presented at the top.
  • Figure 5: The pyramid tri-grid is crucial for alleviating the "grid-like" artifacts. We showcase renderings of results obtained utilizing the two 3D representations (w/ and w/o optimization), accompanied by shapes extracted using Marching Cubes.
  • ...and 4 more figures