Table of Contents
Fetching ...

Pose and Facial Expression Transfer by using StyleGAN

Petr Jahoda, Jan Cech

TL;DR

This work tackles pose and expression transfer between face images by projecting two inputs (driving and identity) into StyleGAN2’s latent space using a motion encoder $E_m$, an identity encoder $E_i$, and a mapping network $M$, with fixed StyleGAN2 at the generator stage. Trained in a self-supervised manner on unlabeled VoxCeleb2 video data, the model achieves near real-time synthesis of new identities with controllable pose and expression by combining the motion embedding of the driving image with the identity embedding of the target image into $z \in \mathcal{W^+}$ and rendering via $G(z)$. The approach outperforms a baseline latent-space editing method and ablations, offering strong identity preservation (ArcFace-like similarity ~0.80) and credible pose/expression transfer across diverse subjects, while incurring hair/background artifacts and requiring a preliminary ReStyle inversion step for target identities. The method enables efficient generation of random identities with controllable facial motions, with practical implications for video synthesis and face-editing applications, and runs in close to real time on standard GPUs.

Abstract

We propose a method to transfer pose and expression between face images. Given a source and target face portrait, the model produces an output image in which the pose and expression of the source face image are transferred onto the target identity. The architecture consists of two encoders and a mapping network that projects the two inputs into the latent space of StyleGAN2, which finally generates the output. The training is self-supervised from video sequences of many individuals. Manual labeling is not required. Our model enables the synthesis of random identities with controllable pose and expression. Close-to-real-time performance is achieved.

Pose and Facial Expression Transfer by using StyleGAN

TL;DR

This work tackles pose and expression transfer between face images by projecting two inputs (driving and identity) into StyleGAN2’s latent space using a motion encoder , an identity encoder , and a mapping network , with fixed StyleGAN2 at the generator stage. Trained in a self-supervised manner on unlabeled VoxCeleb2 video data, the model achieves near real-time synthesis of new identities with controllable pose and expression by combining the motion embedding of the driving image with the identity embedding of the target image into and rendering via . The approach outperforms a baseline latent-space editing method and ablations, offering strong identity preservation (ArcFace-like similarity ~0.80) and credible pose/expression transfer across diverse subjects, while incurring hair/background artifacts and requiring a preliminary ReStyle inversion step for target identities. The method enables efficient generation of random identities with controllable facial motions, with practical implications for video synthesis and face-editing applications, and runs in close to real time on standard GPUs.

Abstract

We propose a method to transfer pose and expression between face images. Given a source and target face portrait, the model produces an output image in which the pose and expression of the source face image are transferred onto the target identity. The architecture consists of two encoders and a mapping network that projects the two inputs into the latent space of StyleGAN2, which finally generates the output. The training is self-supervised from video sequences of many individuals. Manual labeling is not required. Our model enables the synthesis of random identities with controllable pose and expression. Close-to-real-time performance is achieved.

Paper Structure

This paper contains 16 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Results of our method. Pose and expression from the source image is transferred onto the identity of the target image. The method generalizes to paintings, despite being trained on videos of real people.
  • Figure 2: The architecture of the proposed model. The Motion encoder and Mapping network weights are trained, while the Identity encoder and StyleGAN2 weights stay fixed during training.
  • Figure 3: Pose and expression transfer results. The top row depicts the target (identity) input images, leftmost column the source (driving) input images. The grid shows the transfer results. The identities are preserved column-wise, and the poses and expressions are preserved row-wise.
  • Figure 4: Pose and expression transfer comparison. The top two rows represent the input: source and target images. The next row shows the results. The baseline methods, pSp and e4e inversion. The three variants of our method, Ours-Gen with optimized generator weights, Ours-Cos with CosFace loss, and Ours as our best model.