Text-Driven Diverse Facial Texture Generation via Progressive Latent-Space Refinement
Chi Wang, Junming Huang, Rong Zhang, Qi Wang, Haotian Yang, Haibin Huang, Chongyang Ma, Weiwei Xu
TL;DR
This work tackles the challenge of text-driven, physically based rendering (PBR) facial texture generation by introducing PBRGAN, a three-stage progressive latent-space refinement framework. It bootstraps from 3DMM-derived UV textures using a PBR StyleGAN to form a latent space, aligns it with text via CLIP-based prompts, and then expands the space through a GAN-SDS fusion with an edge-aware SDS (EASDS) powered by ControlNet to ensure multi-view facial structure accuracy. The approach reduces reliance on ground-truth PBR data, achieves fast inference, and delivers high-fidelity, diverse albedo, normal, and roughness maps, outperforming state-of-the-art methods in quality and efficiency. The method demonstrates strong potential for practical use in AR/VR/gaming by enabling text-guided, view-consistent facial textures without extensive data curation or retraining, while offering a clear pathway for extending geometry-texture co-generation.
Abstract
Automatic 3D facial texture generation has gained significant interest recently. Existing approaches may not support the traditional physically based rendering pipeline or rely on 3D data captured by Light Stage. Our key contribution is a progressive latent space refinement approach that can bootstrap from 3D Morphable Models (3DMMs)-based texture maps generated from facial images to generate high-quality and diverse PBR textures, including albedo, normal, and roughness. It starts with enhancing Generative Adversarial Networks (GANs) for text-guided and diverse texture generation. To this end, we design a self-supervised paradigm to overcome the reliance on ground truth 3D textures and train the generative model with only entangled texture maps. Besides, we foster mutual enhancement between GANs and Score Distillation Sampling (SDS). SDS boosts GANs with more generative modes, while GANs promote more efficient optimization of SDS. Furthermore, we introduce an edge-aware SDS for multi-view consistent facial structure. Experiments demonstrate that our method outperforms existing 3D texture generation methods regarding photo-realistic quality, diversity, and efficiency.
