Table of Contents
Fetching ...

One-shot Neural Face Reenactment via Finding Directions in GAN's Latent Space

Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, Ioannis Patras, Georgios Tzimiropoulos

TL;DR

This work addresses one-shot neural face reenactment by transferring target head pose and expression to a source face without learning identity-preserving embeddings. It proposes a latent-space direction framework: a linear mapping $ extbf{A}$ links pose/expression changes $oldsymbol{p}$ from a 3D Morphable Model to latent-space shifts $oldsymbol{w}$ in a fine-tuned StyleGAN2, enabling controllable reenactment through $oldsymbol{w_r}=oldsymbol{w_s}+oldsymbol{A}oldsymbol{ abla p}$. The method extends to real images via GAN inversion and mixed real/synthetic training, and introduces joint training of the real-image encoder $ extbf{E}_w$ and $ extbf{A}$ to achieve optimization-free inference, along with a feature-space refinement to improve background and hair details. Experimental results on VoxCeleb1/2 show superior quality for self and cross-subject reenactment compared to state-of-the-art methods, with ablations validating the contribution of individual losses and training variants. The approach offers a practical, efficient path to high-fidelity face reenactment with broad potential applications, while acknowledging limitations tied to data diversity and facial accessories.

Abstract

In this paper, we present our framework for neural face/head reenactment whose goal is to transfer the 3D head orientation and expression of a target face to a source face. Previous methods focus on learning embedding networks for identity and head pose/expression disentanglement which proves to be a rather hard task, degrading the quality of the generated images. We take a different approach, bypassing the training of such networks, by using (fine-tuned) pre-trained GANs which have been shown capable of producing high-quality facial images. Because GANs are characterized by weak controllability, the core of our approach is a method to discover which directions in latent GAN space are responsible for controlling head pose and expression variations. We present a simple pipeline to learn such directions with the aid of a 3D shape model which, by construction, inherently captures disentangled directions for head pose, identity, and expression. Moreover, we show that by embedding real images in the GAN latent space, our method can be successfully used for the reenactment of real-world faces. Our method features several favorable properties including using a single source image (one-shot) and enabling cross-person reenactment. Extensive qualitative and quantitative results show that our approach typically produces reenacted faces of notably higher quality than those produced by state-of-the-art methods for the standard benchmarks of VoxCeleb1 & 2.

One-shot Neural Face Reenactment via Finding Directions in GAN's Latent Space

TL;DR

This work addresses one-shot neural face reenactment by transferring target head pose and expression to a source face without learning identity-preserving embeddings. It proposes a latent-space direction framework: a linear mapping links pose/expression changes from a 3D Morphable Model to latent-space shifts in a fine-tuned StyleGAN2, enabling controllable reenactment through . The method extends to real images via GAN inversion and mixed real/synthetic training, and introduces joint training of the real-image encoder and to achieve optimization-free inference, along with a feature-space refinement to improve background and hair details. Experimental results on VoxCeleb1/2 show superior quality for self and cross-subject reenactment compared to state-of-the-art methods, with ablations validating the contribution of individual losses and training variants. The approach offers a practical, efficient path to high-fidelity face reenactment with broad potential applications, while acknowledging limitations tied to data diversity and facial accessories.

Abstract

In this paper, we present our framework for neural face/head reenactment whose goal is to transfer the 3D head orientation and expression of a target face to a source face. Previous methods focus on learning embedding networks for identity and head pose/expression disentanglement which proves to be a rather hard task, degrading the quality of the generated images. We take a different approach, bypassing the training of such networks, by using (fine-tuned) pre-trained GANs which have been shown capable of producing high-quality facial images. Because GANs are characterized by weak controllability, the core of our approach is a method to discover which directions in latent GAN space are responsible for controlling head pose and expression variations. We present a simple pipeline to learn such directions with the aid of a 3D shape model which, by construction, inherently captures disentangled directions for head pose, identity, and expression. Moreover, we show that by embedding real images in the GAN latent space, our method can be successfully used for the reenactment of real-world faces. Our method features several favorable properties including using a single source image (one-shot) and enabling cross-person reenactment. Extensive qualitative and quantitative results show that our approach typically produces reenacted faces of notably higher quality than those produced by state-of-the-art methods for the standard benchmarks of VoxCeleb1 & 2.
Paper Structure (41 sections, 17 equations, 30 figures, 14 tables)

This paper contains 41 sections, 17 equations, 30 figures, 14 tables.

Figures (30)

  • Figure 1: Overview of the proposed framework: Given a pair of source $\mathbf{I}_s$ and target $\mathbf{I}_t$ images, we calculate the head pose/expression parameter vectors $\mathbf{p}_s$ and $\mathbf{p}_t$ using the $\mathrm{Net_{3D}}$ network, respectively. The matrix of directions $\mathbf{A}$ is trained so as, given the shift $\Delta \mathbf{w} = \mathbf{A}\Delta \mathbf{p}$, the reenacted image $\mathbf{I}_r$ generated using the latent code $\mathbf{w}_r = \mathbf{w}_s + \boldsymbol{\Delta}\mathbf{w}$, transfers the head pose and the expression of the target face, maintaining at the same time the identity of the source face.
  • Figure 2: Examples of face reenactment without ("w/o opt.") and with ("w/ opt.") the generator's optimization. We additionally show results using our proposed joint training scheme ("Joint Training") and the refinement of StyleGAN2's feature space ("FSR") described in Sect. \ref{['ssec:joint_training']} and \ref{['ssec:feature_space']}, respectively.
  • Figure 3: To eliminate the need for the optimization step during inference, we propose to jointly train the real image inversion encoder $\mathcal{E}_w$ and the directions matrix $\mathbf{A}$. We note that during training both the generator $\mathcal{G}$ and the $\mathrm{Net_{3D}}$ network are frozen.
  • Figure 4: Cycle loss: Given a pair of source ($\mathbf{I}_s^1$) and target ($\mathbf{I}_t^1$) images, we calculate the corresponding reenacted image $\mathbf{I}_r^1$. We then use this image as source and as target the source image from the first image pair and we calculate the second reenacted image $\mathbf{I}_r^2$, which is imposed to be similar with $\mathbf{I}_s^1$.
  • Figure 5: Training of feature space encoder $\mathcal{E}_{\mathcal{F}}$ in the real image inversion task. $\mathcal{E}_{\mathcal{F}}$ takes as input a real image and predicts the shift $\Delta f_4$ that updates the feature map $f_4$ of the $4^{th}$ feature layer of StyleGAN2's generator.
  • ...and 25 more figures