Table of Contents
Fetching ...

High-Fidelity 3D Facial Avatar Synthesis with Controllable Fine-Grained Expressions

Yikang He, Jichao Zhang, Wei Wang, Nicu Sebe, Yao Zhao

Abstract

Facial expression editing methods can be mainly categorized into two types based on their architectures: 2D-based and 3D-based methods. The former lacks 3D face modeling capabilities, making it difficult to edit 3D factors effectively. The latter has demonstrated superior performance in generating high-quality and view-consistent renderings using single-view 2D face images. Although these methods have successfully used animatable models to control facial expressions, they still have limitations in achieving precise control over fine-grained expressions. To address this issue, in this paper, we propose a novel approach by simultaneously refining both the latent code of a pretrained 3D-Aware GAN model for texture editing and the expression code of the driven 3DMM model for mesh editing. Specifically, we introduce a Dual Mappers module, comprising Texture Mapper and Emotion Mapper, to learn the transformations of the given latent code for textures and the expression code for meshes, respectively. To optimize the Dual Mappers, we propose a Text-Guided Optimization method, leveraging a CLIP-based objective function with expression text prompts as targets, while integrating a SubSpace Projection mechanism to project the text embedding to the expression subspace such that we can have more precise control over fine-grained expressions. Extensive experiments and comparative analyses demonstrate the effectiveness and superiority of our proposed method.

High-Fidelity 3D Facial Avatar Synthesis with Controllable Fine-Grained Expressions

Abstract

Facial expression editing methods can be mainly categorized into two types based on their architectures: 2D-based and 3D-based methods. The former lacks 3D face modeling capabilities, making it difficult to edit 3D factors effectively. The latter has demonstrated superior performance in generating high-quality and view-consistent renderings using single-view 2D face images. Although these methods have successfully used animatable models to control facial expressions, they still have limitations in achieving precise control over fine-grained expressions. To address this issue, in this paper, we propose a novel approach by simultaneously refining both the latent code of a pretrained 3D-Aware GAN model for texture editing and the expression code of the driven 3DMM model for mesh editing. Specifically, we introduce a Dual Mappers module, comprising Texture Mapper and Emotion Mapper, to learn the transformations of the given latent code for textures and the expression code for meshes, respectively. To optimize the Dual Mappers, we propose a Text-Guided Optimization method, leveraging a CLIP-based objective function with expression text prompts as targets, while integrating a SubSpace Projection mechanism to project the text embedding to the expression subspace such that we can have more precise control over fine-grained expressions. Extensive experiments and comparative analyses demonstrate the effectiveness and superiority of our proposed method.
Paper Structure (24 sections, 11 equations, 15 figures, 6 tables)

This paper contains 24 sections, 11 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Our model can generate high-quality, view-consistent face images while enabling fine-grained expression editing. The first row displays the generated images of different facial expression edits (with the reference image located in the lower-right corner), the second row presents different viewpoints, and the third row showcases the generated 3D head meshes.
  • Figure 2: Overview of the proposed architecture which consists of two main modules: (a) Dual Mappers and (b) Text-Guided Optimization. In (a), Dual Mappers refine the random latent code $\boldsymbol{w}$ and expression code $\boldsymbol{\alpha_{R}}$ from a reference image $\boldsymbol{I_{R}}$ using a Cross Attention module and MLP layers, with updates stabilized by skip connections and a $L_2$ Loss. The refined codes $\boldsymbol{w^\prime}$ and $\boldsymbol{\alpha_{R}^\prime}$ are passed to a frozen synthesis network to generate the edited image. (The red arrow represents replacing $\boldsymbol{w}$,$\boldsymbol{\alpha_{I}}$ with $\boldsymbol{w^\prime}$,$\boldsymbol{\alpha_{R}^\prime}$.) In (b), Text-Guided Optimization aligns the edited image with a text description (e.g., "A person who is raising brow"). A Projection module maps the text embedding $\boldsymbol{e_{T}}$ to the image embedding space, enhancing compatibility. Cosine Similarity Loss aligns the embeddings, while Identity Loss preserves facial identity.
  • Figure 3: The Synthesis Network utilizes Tri-Plane and Volume Rendering to achieve 3D-aware head avatar generation. The generator $\boldsymbol{G_{uv}}$ produces a Tri-Plane from the latent code $\boldsymbol{w}$, while FLAME accepts $\boldsymbol{\alpha}$,$\boldsymbol{\theta}$,$\boldsymbol{\beta}$ as input and provides a coarse head mesh aligned via rasterization. An MLP predicts color $\boldsymbol{c}$ and density $\boldsymbol{\sigma}$, which are used for Volume Rendering. Finally, a Super-Resolution module enhances the output, creating realistic, editable 3D avatars.
  • Figure 4: Fine-grained expression editing comparison with state-of-the-art animatable 3D image synthesis methods. To ensure a strict and direct comparison, methods supporting reference-guided editing within the same domain (Next3D, Diffusion-rig, and Ours) are evaluated using the exact same identity. For unconditional generative models (DiscoFaceGAN, AniFaceGAN) and models trained on entirely disparate datasets (Morphable Diffusion on FaceScape yang2020facescape, Gaussian Head Avatar on NeRSemble kirschstein2023nersemble), forcing a specific target identity would require error-prone inversion or cross-domain adaptation. To avoid disadvantaging these baselines with reconstruction artifacts, we display high-quality native identities sampled directly from their respective distributions. Fig. \ref{['figure:zoom-in']} illustrates a zoomed comparison of the red-box region.
  • Figure 5: In contrast to Next3D and Diffusion-rig, our approach excels in capturing fine-grained expression facial details, such as the curvature of the mouth corners during smiling and the positioning of the eyebrows during frowning, bringing generated facial features closer to the reference image and target emotion text descriptions.
  • ...and 10 more figures