G3FA: Geometry-guided GAN for Face Animation

Alireza Javanmardi; Alain Pagani; Didier Stricker

G3FA: Geometry-guided GAN for Face Animation

Alireza Javanmardi, Alain Pagani, Didier Stricker

TL;DR

Real-time talking head synthesis from a single image often suffers from geometry inconsistencies under pose variation when relying solely on 2D information. G3FA introduces implicit 3D supervision by integrating neural inverse rendering-derived depth and normals into a GAN-based face animation pipeline, leveraging an ensemble of discriminators and a 2D motion-estimation front end with a face volume rendering generator. The approach combines Unsup3D-based geometry cues, adaptive ray sampling, and volume rendering to produce geometry-consistent, photorealistic outputs, validated on VoxCeleb2 and TalkingHead against multiple state-of-the-art methods. It achieves improved geometry fidelity and identity preservation with minimal impact on inference time, and is designed to be readily integrated with existing GAN-based reenactment architectures.

Abstract

Animating human face images aims to synthesize a desired source identity in a natural-looking way mimicking a driving video's facial movements. In this context, Generative Adversarial Networks have demonstrated remarkable potential in real-time face reenactment using a single source image, yet are constrained by limited geometry consistency compared to graphic-based approaches. In this paper, we introduce Geometry-guided GAN for Face Animation (G3FA) to tackle this limitation. Our novel approach empowers the face animation model to incorporate 3D information using only 2D images, improving the image generation capabilities of the talking head synthesis model. We integrate inverse rendering techniques to extract 3D facial geometry properties, improving the feedback loop to the generator through a weighted average ensemble of discriminators. In our face reenactment model, we leverage 2D motion warping to capture motion dynamics along with orthogonal ray sampling and volume rendering techniques to produce the ultimate visual output. To evaluate the performance of our G3FA, we conducted comprehensive experiments using various evaluation protocols on VoxCeleb2 and TalkingHead benchmarks to demonstrate the effectiveness of our proposed framework compared to the state-of-the-art real-time face animation methods.

G3FA: Geometry-guided GAN for Face Animation

TL;DR

Abstract

Paper Structure (18 sections, 12 equations, 4 figures, 2 tables)

This paper contains 18 sections, 12 equations, 4 figures, 2 tables.

Introduction
Related Works
Face Reenactment
Generative Models
Method
2D Motion Estimation
Rendering
Neural Inverse Rendering
Ensemble of Discriminators
Experiment
Implementation Details
Training details
Evaluation Metrics
Comparison with State-of-the-Art
Same-identity Reconstruction
...and 3 more sections

Figures (4)

Figure 1: Implicit 3D supervision: This figure shows how an inverse rendering module can guide the generator to generate more geometry-consistent output. We utilized canonical shading here to better visualize the differences between two cases and it is not used in the model's pipeline.
Figure 2: Face animation pipeline: Capturing facial expression and pose based on keypoints, followed by an implicit 3D supervision using inverse rendering and an ensemble of discriminators. NIR stands for Neural Inverse Rendering which is a pre-trained model and FVR is our Face Volume Rendering module.
Figure 3: Same-identity reconstruction: Our method exhibits superior performance in terms of both photorealism image generation and precise synthesis of fine details on VoxCeleb2 chung18b_interspeech.
Figure 4: Cross-identity reenactment: demonstrating our method's superiority in geometry reconstruction and photorealistic face generation through a Qualitative Comparison on the TK Datasetwang2021one.

G3FA: Geometry-guided GAN for Face Animation

TL;DR

Abstract

G3FA: Geometry-guided GAN for Face Animation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)