Table of Contents
Fetching ...

DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance

Longwen Zhang, Qiwei Qiu, Hongyang Lin, Qixuan Zhang, Cheng Shi, Wei Yang, Ye Shi, Sibei Yang, Lan Xu, Jingyi Yu

TL;DR

<3-5 sentence high-level summary> DreamFace tackles the challenge of text-guided creation of animatable, physically-based 3D facial assets that integrate smoothly into existing CG pipelines. It introduces a progressive three-module pipeline: geometry generation (coarse-to-fine, CLIP-guided selection, SDS refinement), texture diffusion (dual-path LDM with latent and image-space SDS, domain-aware prompt tuning, and UV-space texture fidelity), and animatability empowerment (cross-identity hypernetwork plus video-driven expression encoder). The approach yields detailed geometric and texture representations with high rendering fidelity and supports personalized animation from video, enabling broad applications in digital humans for media, gaming, and Metaverse contexts. Extensive experiments, ablations, and user studies demonstrate the effectiveness of the texture LDM, detail carving, and animation components, along with a discussion of limitations and ethical considerations.

Abstract

Emerging Metaverse applications demand accessible, accurate, and easy-to-use tools for 3D digital human creations in order to depict different cultures and societies as if in the physical world. Recent large-scale vision-language advances pave the way to for novices to conveniently customize 3D content. However, the generated CG-friendly assets still cannot represent the desired facial traits for human characteristics. In this paper, we present DreamFace, a progressive scheme to generate personalized 3D faces under text guidance. It enables layman users to naturally customize 3D facial assets that are compatible with CG pipelines, with desired shapes, textures, and fine-grained animation capabilities. From a text input to describe the facial traits, we first introduce a coarse-to-fine scheme to generate the neutral facial geometry with a unified topology. We employ a selection strategy in the CLIP embedding space, and subsequently optimize both the details displacements and normals using Score Distillation Sampling from generic Latent Diffusion Model. Then, for neutral appearance generation, we introduce a dual-path mechanism, which combines the generic LDM with a novel texture LDM to ensure both the diversity and textural specification in the UV space. We also employ a two-stage optimization to perform SDS in both the latent and image spaces to significantly provides compact priors for fine-grained synthesis. Our generated neutral assets naturally support blendshapes-based facial animations. We further improve the animation ability with personalized deformation characteristics by learning the universal expression prior using the cross-identity hypernetwork. Notably, DreamFace can generate of realistic 3D facial assets with physically-based rendering quality and rich animation ability from video footage, even for fashion icons or exotic characters in cartoons and fiction movies.

DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance

TL;DR

<3-5 sentence high-level summary> DreamFace tackles the challenge of text-guided creation of animatable, physically-based 3D facial assets that integrate smoothly into existing CG pipelines. It introduces a progressive three-module pipeline: geometry generation (coarse-to-fine, CLIP-guided selection, SDS refinement), texture diffusion (dual-path LDM with latent and image-space SDS, domain-aware prompt tuning, and UV-space texture fidelity), and animatability empowerment (cross-identity hypernetwork plus video-driven expression encoder). The approach yields detailed geometric and texture representations with high rendering fidelity and supports personalized animation from video, enabling broad applications in digital humans for media, gaming, and Metaverse contexts. Extensive experiments, ablations, and user studies demonstrate the effectiveness of the texture LDM, detail carving, and animation components, along with a discussion of limitations and ethical considerations.

Abstract

Emerging Metaverse applications demand accessible, accurate, and easy-to-use tools for 3D digital human creations in order to depict different cultures and societies as if in the physical world. Recent large-scale vision-language advances pave the way to for novices to conveniently customize 3D content. However, the generated CG-friendly assets still cannot represent the desired facial traits for human characteristics. In this paper, we present DreamFace, a progressive scheme to generate personalized 3D faces under text guidance. It enables layman users to naturally customize 3D facial assets that are compatible with CG pipelines, with desired shapes, textures, and fine-grained animation capabilities. From a text input to describe the facial traits, we first introduce a coarse-to-fine scheme to generate the neutral facial geometry with a unified topology. We employ a selection strategy in the CLIP embedding space, and subsequently optimize both the details displacements and normals using Score Distillation Sampling from generic Latent Diffusion Model. Then, for neutral appearance generation, we introduce a dual-path mechanism, which combines the generic LDM with a novel texture LDM to ensure both the diversity and textural specification in the UV space. We also employ a two-stage optimization to perform SDS in both the latent and image spaces to significantly provides compact priors for fine-grained synthesis. Our generated neutral assets naturally support blendshapes-based facial animations. We further improve the animation ability with personalized deformation characteristics by learning the universal expression prior using the cross-identity hypernetwork. Notably, DreamFace can generate of realistic 3D facial assets with physically-based rendering quality and rich animation ability from video footage, even for fashion icons or exotic characters in cartoons and fiction movies.
Paper Structure (42 sections, 17 equations, 18 figures)

This paper contains 42 sections, 17 equations, 18 figures.

Figures (18)

  • Figure 1: The overview of DreamFace. Our pipeline mainly includes three modules, including geometry generation (Sec. \ref{['sec:geometry']}), physically-based texture diffusion (Sec. \ref{['sec:appearance']}), and animatability empowerment (Sec. \ref{['sec:animation']}). Given textual guidance, DreamFace is able to generate facial assets that closely resemble the described characteristics in terms of shape and appearance. Our approach is consistent with industry standards in computer graphics production and is able to achieve photo-realistic results when driven and rendered.
  • Figure 2: Geometry generation pipeline. Given the input prompt, we utilize the CLIP model to select the coarse geometry candidates with the highest matching score. Next, we employ a generic LDM to perform SDS on the rendered images under random view and lighting conditions. This allows us to add facial details to the coarse geometry via vertex displacement and detailed normal map, resulting in a highly detailed geometry.
  • Figure 3: The overview of physically-based texture diffusion. To generate detailed and realistic textures that match the input prompt, DreamFace performs Dual-path SDS on textures with the use of both a generic LDM and a texture LDM, in both the latent space and image space. By jointly optimizing using two LDMs, we are able to generate high-quality diffuse texture maps that match the input prompt and are consistent with UV unwrapping. An additional texture translation and augmentation module are also included to generate all physically-based textures with high resolution, suitable for rendering.
  • Figure 4:
  • Figure 5: The overview of our Texture LDM training pipeline. Our approach utilizes two methods to generate high-quality diffuse maps: (1) Prompt Tuning, instead of handcraft domain-specific text prompts, two domain-specific continuous text prompts $\mathcal{C}_\text{d}$ and $\mathcal{C}_\text{u}$ are combined with corresponding text prompt, which will be optimized during U-Net denoiser training to avoid unstable and time-consuming prompt engineering for handcraft prompt generation. (2) Non-face region masking, the denoising process of LDM will be additionally conditioned on a non-face region mask to ensure that the generated diffuse map is free of any undesired elements.
  • ...and 13 more figures