Pose and Facial Expression Transfer by using StyleGAN
Petr Jahoda, Jan Cech
TL;DR
This work tackles pose and expression transfer between face images by projecting two inputs (driving and identity) into StyleGAN2’s latent space using a motion encoder $E_m$, an identity encoder $E_i$, and a mapping network $M$, with fixed StyleGAN2 at the generator stage. Trained in a self-supervised manner on unlabeled VoxCeleb2 video data, the model achieves near real-time synthesis of new identities with controllable pose and expression by combining the motion embedding of the driving image with the identity embedding of the target image into $z \in \mathcal{W^+}$ and rendering via $G(z)$. The approach outperforms a baseline latent-space editing method and ablations, offering strong identity preservation (ArcFace-like similarity ~0.80) and credible pose/expression transfer across diverse subjects, while incurring hair/background artifacts and requiring a preliminary ReStyle inversion step for target identities. The method enables efficient generation of random identities with controllable facial motions, with practical implications for video synthesis and face-editing applications, and runs in close to real time on standard GPUs.
Abstract
We propose a method to transfer pose and expression between face images. Given a source and target face portrait, the model produces an output image in which the pose and expression of the source face image are transferred onto the target identity. The architecture consists of two encoders and a mapping network that projects the two inputs into the latent space of StyleGAN2, which finally generates the output. The training is self-supervised from video sequences of many individuals. Manual labeling is not required. Our model enables the synthesis of random identities with controllable pose and expression. Close-to-real-time performance is achieved.
