Table of Contents
Fetching ...

Controllable 3D Face Generation with Conditional Style Code Diffusion

Xiaolong Shen, Jianxin Ma, Chang Zhou, Zongxin Yang

TL;DR

This work tackles efficient, controllable photorealistic 3D face generation under text and expression conditions. It introduces TEx-Face, a three-component framework combining 3D GAN Inversion with Pose-guided and Refined PoI, a Conditional Style Code Diffusion module with a style-code denoiser that fuses CLIP text and EMOCA expressions, and a 3D Face Decoding stage based on EG3D. Key contributions include view-consistent inversion through PoI/Re-PoI, multi-condition diffusion that injects text and expression into style codes via time-step conditioning, and a data-augmentation strategy to expand paired visual-language data; together, they enable rapid, multi-condition 3D face synthesis with strong alignment to prompts and expressions. Evaluations on FFHQ, CelebA-HQ, and CelebA-Dialog show efficient inference (≈0.1 s per sample) and improved 3D consistency and text/expression fidelity, highlighting potential for real-time applications in AR/VR and interactive media.

Abstract

Generating photorealistic 3D faces from given conditions is a challenging task. Existing methods often rely on time-consuming one-by-one optimization approaches, which are not efficient for modeling the same distribution content, e.g., faces. Additionally, an ideal controllable 3D face generation model should consider both facial attributes and expressions. Thus we propose a novel approach called TEx-Face(TExt & Expression-to-Face) that addresses these challenges by dividing the task into three components, i.e., 3D GAN Inversion, Conditional Style Code Diffusion, and 3D Face Decoding. For 3D GAN inversion, we introduce two methods which aim to enhance the representation of style codes and alleviate 3D inconsistencies. Furthermore, we design a style code denoiser to incorporate multiple conditions into the style code and propose a data augmentation strategy to address the issue of insufficient paired visual-language data. Extensive experiments conducted on FFHQ, CelebA-HQ, and CelebA-Dialog demonstrate the promising performance of our TEx-Face in achieving the efficient and controllable generation of photorealistic 3D faces. The code will be available at https://github.com/sxl142/TEx-Face.

Controllable 3D Face Generation with Conditional Style Code Diffusion

TL;DR

This work tackles efficient, controllable photorealistic 3D face generation under text and expression conditions. It introduces TEx-Face, a three-component framework combining 3D GAN Inversion with Pose-guided and Refined PoI, a Conditional Style Code Diffusion module with a style-code denoiser that fuses CLIP text and EMOCA expressions, and a 3D Face Decoding stage based on EG3D. Key contributions include view-consistent inversion through PoI/Re-PoI, multi-condition diffusion that injects text and expression into style codes via time-step conditioning, and a data-augmentation strategy to expand paired visual-language data; together, they enable rapid, multi-condition 3D face synthesis with strong alignment to prompts and expressions. Evaluations on FFHQ, CelebA-HQ, and CelebA-Dialog show efficient inference (≈0.1 s per sample) and improved 3D consistency and text/expression fidelity, highlighting potential for real-time applications in AR/VR and interactive media.

Abstract

Generating photorealistic 3D faces from given conditions is a challenging task. Existing methods often rely on time-consuming one-by-one optimization approaches, which are not efficient for modeling the same distribution content, e.g., faces. Additionally, an ideal controllable 3D face generation model should consider both facial attributes and expressions. Thus we propose a novel approach called TEx-Face(TExt & Expression-to-Face) that addresses these challenges by dividing the task into three components, i.e., 3D GAN Inversion, Conditional Style Code Diffusion, and 3D Face Decoding. For 3D GAN inversion, we introduce two methods which aim to enhance the representation of style codes and alleviate 3D inconsistencies. Furthermore, we design a style code denoiser to incorporate multiple conditions into the style code and propose a data augmentation strategy to address the issue of insufficient paired visual-language data. Extensive experiments conducted on FFHQ, CelebA-HQ, and CelebA-Dialog demonstrate the promising performance of our TEx-Face in achieving the efficient and controllable generation of photorealistic 3D faces. The code will be available at https://github.com/sxl142/TEx-Face.
Paper Structure (14 sections, 7 equations, 7 figures, 7 tables)

This paper contains 14 sections, 7 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: An overview of our pipeline. We train an inversion model called Re-PoI and save the style codes it infers. These saved codes are then used to train a style code denoiser with three conditions, i.e., time steps, text prompts, and expression codes. When inference, we decode the generated style codes into 3D faces using EG3D.
  • Figure 2: Our pipeline enables conditional 3D face generation using text, expression codes, and both of them. The stylization is achieved by StyleGAN-NADA gal2021stylegannada, a GAN-based style transfer method.
  • Figure 3: Simply extending the 2D Inversion method hu2022style leads to bad novel view synthesis, especially when inputting an image of side views.
  • Figure 4: An overview of our inversion method. In pretraining, we first leverage the synthetic multi-view images to learn a mapping that projects the style codes under different views onto one style code, which yields a view-invariant style code, thereby alleviating 3D inconsistency. In finetuning, we freeze the learned mapping (PoM) and MLP in the PoI and further append the refinement branch to improve the quality of the style codes. Note the PoM of the refinement branch is used for training.
  • Figure 5: A schematic of Style Code Denoiser.
  • ...and 2 more figures