Text-Driven Diverse Facial Texture Generation via Progressive Latent-Space Refinement

Chi Wang; Junming Huang; Rong Zhang; Qi Wang; Haotian Yang; Haibin Huang; Chongyang Ma; Weiwei Xu

Text-Driven Diverse Facial Texture Generation via Progressive Latent-Space Refinement

Chi Wang, Junming Huang, Rong Zhang, Qi Wang, Haotian Yang, Haibin Huang, Chongyang Ma, Weiwei Xu

TL;DR

This work tackles the challenge of text-driven, physically based rendering (PBR) facial texture generation by introducing PBRGAN, a three-stage progressive latent-space refinement framework. It bootstraps from 3DMM-derived UV textures using a PBR StyleGAN to form a latent space, aligns it with text via CLIP-based prompts, and then expands the space through a GAN-SDS fusion with an edge-aware SDS (EASDS) powered by ControlNet to ensure multi-view facial structure accuracy. The approach reduces reliance on ground-truth PBR data, achieves fast inference, and delivers high-fidelity, diverse albedo, normal, and roughness maps, outperforming state-of-the-art methods in quality and efficiency. The method demonstrates strong potential for practical use in AR/VR/gaming by enabling text-guided, view-consistent facial textures without extensive data curation or retraining, while offering a clear pathway for extending geometry-texture co-generation.

Abstract

Automatic 3D facial texture generation has gained significant interest recently. Existing approaches may not support the traditional physically based rendering pipeline or rely on 3D data captured by Light Stage. Our key contribution is a progressive latent space refinement approach that can bootstrap from 3D Morphable Models (3DMMs)-based texture maps generated from facial images to generate high-quality and diverse PBR textures, including albedo, normal, and roughness. It starts with enhancing Generative Adversarial Networks (GANs) for text-guided and diverse texture generation. To this end, we design a self-supervised paradigm to overcome the reliance on ground truth 3D textures and train the generative model with only entangled texture maps. Besides, we foster mutual enhancement between GANs and Score Distillation Sampling (SDS). SDS boosts GANs with more generative modes, while GANs promote more efficient optimization of SDS. Furthermore, we introduce an edge-aware SDS for multi-view consistent facial structure. Experiments demonstrate that our method outperforms existing 3D texture generation methods regarding photo-realistic quality, diversity, and efficiency.

Text-Driven Diverse Facial Texture Generation via Progressive Latent-Space Refinement

TL;DR

Abstract

Paper Structure (30 sections, 3 equations, 7 figures, 1 table)

This paper contains 30 sections, 3 equations, 7 figures, 1 table.

Introduction
Related Work
3D-aware image synthesis.
Text-to-3D generation.
PBR texture generation.
Our Method
PBR StyleGAN
Network structure.
Differentiable-rendering-based training.
Regularization terms.
Text and Latent Alignment
Text prompt generation.
The alignment framework.
The alignment loss functions.
Latent Space Refinement
...and 15 more sections

Figures (7)

Figure 1: Our method can faithfully generate a variety of facial textures from text prompts for photo-realistic rendering. From left to right: (a) unconditional generation results, (b) multi-view rendering results using our generated PBR texture from an uncommon prompt "scar", (c) relighting results using our generated PBR texture of "Barack Obama".
Figure 2: The key idea of latent space refinement approach. The latent space is expanded progressively to handle more text prompts.
Figure 3: The pipeline of PBRGAN. (a) We generate disentangled PBR textures by leveraging entangled FFHQ-UV textures and differentiable rendering. (b) We align the latent space with the text space under the guidance of CLIP to achieve text-guided generation. (c) We amalgamate GAN and SDS and further expand the latent space to handle more text prompts.
Figure 4: Explanation of our edge-aware SDS (EASDS). We sample from different viewpoints and employ soft edges detected from a template texture as feature-line conditions for ControlNet.
Figure 5: Qualitative comparison results. From left to right: the input text prompts, the results generated with ClipFace, DreamFace, Fantasia3D, ours, and ours-EASDS.
...and 2 more figures

Text-Driven Diverse Facial Texture Generation via Progressive Latent-Space Refinement

TL;DR

Abstract

Text-Driven Diverse Facial Texture Generation via Progressive Latent-Space Refinement

Authors

TL;DR

Abstract

Table of Contents

Figures (7)