Table of Contents
Fetching ...

Fast Text-to-3D-Aware Face Generation and Manipulation via Direct Cross-modal Mapping and Geometric Regularization

Jinlu Zhang, Yiyi Zhou, Qiancheng Zheng, Xiaoxiong Du, Gen Luo, Jun Peng, Xiaoshuai Sun, Rongrong Ji

TL;DR

This paper tackles the inefficiency and quality gaps in text-to-3D-aware face generation by proposing E^3-FaceNet, an end-to-end framework that directly maps text to a 3D-aware rendering space built on StyleNeRF. It introduces two key innovations: the Style Code Enhancer, which injects fine-grained textual semantics via cross-modal attention and per-block style-code offsets, and a Geometric Regularization term that stabilizes multi-view geometry through depth and normal-based constraints. The approach achieves state-of-the-art semantic alignment and multi-view consistency while delivering orders-of-magnitude speedups over prior T3D methods, including up to about 470× faster five-view generation, and enables text-guided 3D face manipulation without per-instruction optimization. Extensive experiments on MMCelebA-HQ, CelebA-Text-HQ, and FFHQ-Text demonstrate superior image quality, 3D consistency, and editing capabilities, with clear advantages over both 2D and 3D baselines. The work delivers a practical, real-time-capable solution for text-driven 3D face synthesis and editing with strong generalization to unseen prompts.

Abstract

Text-to-3D-aware face (T3D Face) generation and manipulation is an emerging research hot spot in machine learning, which still suffers from low efficiency and poor quality. In this paper, we propose an End-to-End Efficient and Effective network for fast and accurate T3D face generation and manipulation, termed $E^3$-FaceNet. Different from existing complex generation paradigms, $E^3$-FaceNet resorts to a direct mapping from text instructions to 3D-aware visual space. We introduce a novel Style Code Enhancer to enhance cross-modal semantic alignment, alongside an innovative Geometric Regularization objective to maintain consistency across multi-view generations. Extensive experiments on three benchmark datasets demonstrate that $E^3$-FaceNet can not only achieve picture-like 3D face generation and manipulation, but also improve inference speed by orders of magnitudes. For instance, compared with Latent3D, $E^3$-FaceNet speeds up the five-view generations by almost 470 times, while still exceeding in generation quality. Our code is released at https://github.com/Aria-Zhangjl/E3-FaceNet.

Fast Text-to-3D-Aware Face Generation and Manipulation via Direct Cross-modal Mapping and Geometric Regularization

TL;DR

This paper tackles the inefficiency and quality gaps in text-to-3D-aware face generation by proposing E^3-FaceNet, an end-to-end framework that directly maps text to a 3D-aware rendering space built on StyleNeRF. It introduces two key innovations: the Style Code Enhancer, which injects fine-grained textual semantics via cross-modal attention and per-block style-code offsets, and a Geometric Regularization term that stabilizes multi-view geometry through depth and normal-based constraints. The approach achieves state-of-the-art semantic alignment and multi-view consistency while delivering orders-of-magnitude speedups over prior T3D methods, including up to about 470× faster five-view generation, and enables text-guided 3D face manipulation without per-instruction optimization. Extensive experiments on MMCelebA-HQ, CelebA-Text-HQ, and FFHQ-Text demonstrate superior image quality, 3D consistency, and editing capabilities, with clear advantages over both 2D and 3D baselines. The work delivers a practical, real-time-capable solution for text-driven 3D face synthesis and editing with strong generalization to unseen prompts.

Abstract

Text-to-3D-aware face (T3D Face) generation and manipulation is an emerging research hot spot in machine learning, which still suffers from low efficiency and poor quality. In this paper, we propose an End-to-End Efficient and Effective network for fast and accurate T3D face generation and manipulation, termed -FaceNet. Different from existing complex generation paradigms, -FaceNet resorts to a direct mapping from text instructions to 3D-aware visual space. We introduce a novel Style Code Enhancer to enhance cross-modal semantic alignment, alongside an innovative Geometric Regularization objective to maintain consistency across multi-view generations. Extensive experiments on three benchmark datasets demonstrate that -FaceNet can not only achieve picture-like 3D face generation and manipulation, but also improve inference speed by orders of magnitudes. For instance, compared with Latent3D, -FaceNet speeds up the five-view generations by almost 470 times, while still exceeding in generation quality. Our code is released at https://github.com/Aria-Zhangjl/E3-FaceNet.
Paper Structure (26 sections, 28 equations, 13 figures, 4 tables)

This paper contains 26 sections, 28 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: (a) The 3D-aware portraits generated by direct cross-modal mapping. Without any 3D regularization and semantic enhancement, the model tends to generate faces with artifacts (labeled in red box) and fails to generate the accurate face attribute (shown in green color). (b) Multi-view results of T3D face generation and manipulation by our $E^3$-FaceNet. With the proposed Style Code Enhancer and Geometric Regularization, $E^3$-FaceNet can well align the text instructions and generate high-quality 3D-aware face images without obvious defects (red for generation and blue for manipulation). Notably, the speeds of five-view image generation and manipulation of $E^3$-FaceNet are only 0.64s and 0.89s, respectively. The backgrounds of these images are removed for better comparison.
  • Figure 2: The overall structure of $E^3$-FaceNet. $E^3$-FaceNet uses the extracted text features to module the sampled noise, thereby achieving text-guided 3D-aware generation. To improve fine-grained semantic alignment, we introduce a Style Code Enhancer (SCE) to predict style code offsets in each upsample block, which also enables $E^3$-FaceNet to perform fast face editing through the interpolation of offsets.
  • Figure 3: Illustration of the corresponding 3D location $P_{(u,v)}$ and normal vector $n_{(u,v)}$ for a given pixel $p_{(u,v)}$. To generate smooth and natural-looking faces with the absence of multi-view information, we propose to regularize these two terms in 3D space.
  • Figure 4: Comparison with T3D Face generation and manipulation methods. $E^3$-FaceNet can generate face images of better quality than the compared methods, while well aligning the generation and editing prompts. Its inference speed is also much faster.
  • Figure 5: The visualizations of different settings of $E^3$-FaceNet. The bottom images are the correspondent feature map
  • ...and 8 more figures