Fast Text-to-3D-Aware Face Generation and Manipulation via Direct Cross-modal Mapping and Geometric Regularization
Jinlu Zhang, Yiyi Zhou, Qiancheng Zheng, Xiaoxiong Du, Gen Luo, Jun Peng, Xiaoshuai Sun, Rongrong Ji
TL;DR
This paper tackles the inefficiency and quality gaps in text-to-3D-aware face generation by proposing E^3-FaceNet, an end-to-end framework that directly maps text to a 3D-aware rendering space built on StyleNeRF. It introduces two key innovations: the Style Code Enhancer, which injects fine-grained textual semantics via cross-modal attention and per-block style-code offsets, and a Geometric Regularization term that stabilizes multi-view geometry through depth and normal-based constraints. The approach achieves state-of-the-art semantic alignment and multi-view consistency while delivering orders-of-magnitude speedups over prior T3D methods, including up to about 470× faster five-view generation, and enables text-guided 3D face manipulation without per-instruction optimization. Extensive experiments on MMCelebA-HQ, CelebA-Text-HQ, and FFHQ-Text demonstrate superior image quality, 3D consistency, and editing capabilities, with clear advantages over both 2D and 3D baselines. The work delivers a practical, real-time-capable solution for text-driven 3D face synthesis and editing with strong generalization to unseen prompts.
Abstract
Text-to-3D-aware face (T3D Face) generation and manipulation is an emerging research hot spot in machine learning, which still suffers from low efficiency and poor quality. In this paper, we propose an End-to-End Efficient and Effective network for fast and accurate T3D face generation and manipulation, termed $E^3$-FaceNet. Different from existing complex generation paradigms, $E^3$-FaceNet resorts to a direct mapping from text instructions to 3D-aware visual space. We introduce a novel Style Code Enhancer to enhance cross-modal semantic alignment, alongside an innovative Geometric Regularization objective to maintain consistency across multi-view generations. Extensive experiments on three benchmark datasets demonstrate that $E^3$-FaceNet can not only achieve picture-like 3D face generation and manipulation, but also improve inference speed by orders of magnitudes. For instance, compared with Latent3D, $E^3$-FaceNet speeds up the five-view generations by almost 470 times, while still exceeding in generation quality. Our code is released at https://github.com/Aria-Zhangjl/E3-FaceNet.
