Table of Contents
Fetching ...

Dual Diffusion Models for Multi-modal Guided 3D Avatar Generation

Hong Li, Yutang Feng, Minqi Meng, Yichen Yang, Xuhui Liu, Baochang Zhang

TL;DR

By learning the direct mapping from multi-modal prompts to 3D representations, PromptAvatar eliminates the need for time-consuming iterative optimization, successfully generating high-fidelity, shading-free 3D avatars in under 10 seconds.

Abstract

Generating high-fidelity 3D avatars from text or image prompts is highly sought after in virtual reality and human-computer interaction. However, existing text-driven methods often rely on iterative Score Distillation Sampling (SDS) or CLIP optimization, which struggle with fine-grained semantic control and suffer from excessively slow inference. Meanwhile, image-driven approaches are severely bottlenecked by the scarcity and high acquisition cost of high-quality 3D facial scans, limiting model generalization. To address these challenges, we first construct a novel, large-scale dataset comprising over 100,000 pairs across four modalities: fine-grained textual descriptions, in-the-wild face images, high-quality light-normalized texture UV maps, and 3D geometric shapes. Leveraging this comprehensive dataset, we propose PromptAvatar, a framework featuring dual diffusion models. Specifically, it integrates a Texture Diffusion Model (TDM) that supports flexible multi-condition guidance from text and/or image prompts, alongside a Geometry Diffusion Model (GDM) guided by text prompts. By learning the direct mapping from multi-modal prompts to 3D representations, PromptAvatar eliminates the need for time-consuming iterative optimization, successfully generating high-fidelity, shading-free 3D avatars in under 10 seconds. Extensive quantitative and qualitative experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in generation quality, fine-grained detail alignment, and computational efficiency.

Dual Diffusion Models for Multi-modal Guided 3D Avatar Generation

TL;DR

By learning the direct mapping from multi-modal prompts to 3D representations, PromptAvatar eliminates the need for time-consuming iterative optimization, successfully generating high-fidelity, shading-free 3D avatars in under 10 seconds.

Abstract

Generating high-fidelity 3D avatars from text or image prompts is highly sought after in virtual reality and human-computer interaction. However, existing text-driven methods often rely on iterative Score Distillation Sampling (SDS) or CLIP optimization, which struggle with fine-grained semantic control and suffer from excessively slow inference. Meanwhile, image-driven approaches are severely bottlenecked by the scarcity and high acquisition cost of high-quality 3D facial scans, limiting model generalization. To address these challenges, we first construct a novel, large-scale dataset comprising over 100,000 pairs across four modalities: fine-grained textual descriptions, in-the-wild face images, high-quality light-normalized texture UV maps, and 3D geometric shapes. Leveraging this comprehensive dataset, we propose PromptAvatar, a framework featuring dual diffusion models. Specifically, it integrates a Texture Diffusion Model (TDM) that supports flexible multi-condition guidance from text and/or image prompts, alongside a Geometry Diffusion Model (GDM) guided by text prompts. By learning the direct mapping from multi-modal prompts to 3D representations, PromptAvatar eliminates the need for time-consuming iterative optimization, successfully generating high-fidelity, shading-free 3D avatars in under 10 seconds. Extensive quantitative and qualitative experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in generation quality, fine-grained detail alignment, and computational efficiency.
Paper Structure (20 sections, 5 equations, 10 figures, 7 tables)

This paper contains 20 sections, 5 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: PromptAvatar generates realistic and animatable 3D avatars from a single text prompt, image, or both, compatible with 3D rendering engines like Blender. The top-left corner uses a text prompt to create an accurate texture UV-map and mesh. The bottom-left corner combines text with an FLUX-dev-1.0-generated team2025zimage image to guide high-quality texture UV-map creation. When an image prompt is used, facial geometry is extracted via a pre-trained 3D face reconstruction network bai2022ffhqdeep3d2020. On the right, image prompts enable detailed texture effects like crow’s feet and beards.
  • Figure 2: Our dataset creation pipeline consists of three main modules: De-lighting and Re-lighting face image generation, Incomplete UV-map correction and completion, and Identity coefficients estimation and facial attribute description.
  • Figure 3: Architecture of PromptAvatar. The framework comprises a Texture Diffusion Model (TDM), which targets high-quality normalized texture generation, and a Geometry Diffusion Model (GDM) for geometric identity coefficients. Both models are guided by multi-modal prompts embedded via CLIP. For image prompts, incomplete textures are encoded into the latent space to provide localized guidance.
  • Figure 4: The network architecture of GDM.
  • Figure 5: Visual comparison with DreamFace and Describe3D. Our results demonstrate superior alignment with fine-grained text prompts (highlighted in red). For instance, in the first row, observe the distribution of facial hair, eyebrow shape, and chin, and in the second row, note the facial shape, eye bags, and other features.
  • ...and 5 more figures