Table of Contents
Fetching ...

FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features

Andre Rochow, Max Schwarz, Sven Behnke

TL;DR

FSRT addresses cross-reenactment in facial animation by learning a set-latent representation $\\{z_z\\}$ of the source face that factorizes appearance, head pose, and facial expression. A transformer encoder processes patch embeddings from one or more source images, and a per-pixel transformer decoder renders colors conditioned on driving keypoints $k_D$ and latent expression $e_D$, enabling flexible, multi-source reenactment without explicit motion modeling. The authors introduce targeted augmentation and a VICReg-inspired statistical regularization to encourage disentanglement and generalization, along with adversarial and perceptual losses to boost realism. On VoxCeleb, FSRT achieves state-of-the-art motion transfer quality and temporal consistency, supports relative motion transfer, and offers real-time inference potential with scalable throughput across multiple GPUs, making cross-reenactment more robust and practical.

Abstract

The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image, which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame, which is then inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder, which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance, head pose, and facial expressions. Thus, they are perfectly suited for cross-reenactment. In contrast to most related work, our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluated our approach in a randomized user study. The results indicate superior performance compared to the state-of-the-art in terms of motion transfer quality and temporal consistency.

FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features

TL;DR

FSRT addresses cross-reenactment in facial animation by learning a set-latent representation of the source face that factorizes appearance, head pose, and facial expression. A transformer encoder processes patch embeddings from one or more source images, and a per-pixel transformer decoder renders colors conditioned on driving keypoints and latent expression , enabling flexible, multi-source reenactment without explicit motion modeling. The authors introduce targeted augmentation and a VICReg-inspired statistical regularization to encourage disentanglement and generalization, along with adversarial and perceptual losses to boost realism. On VoxCeleb, FSRT achieves state-of-the-art motion transfer quality and temporal consistency, supports relative motion transfer, and offers real-time inference potential with scalable throughput across multiple GPUs, making cross-reenactment more robust and practical.

Abstract

The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image, which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame, which is then inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder, which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance, head pose, and facial expressions. Thus, they are perfectly suited for cross-reenactment. In contrast to most related work, our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluated our approach in a randomized user study. The results indicate superior performance compared to the state-of-the-art in terms of motion transfer quality and temporal consistency.
Paper Structure (51 sections, 17 equations, 16 figures, 4 tables)

This paper contains 51 sections, 17 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Overview of our method (relative motion transfer). The source image(s) are encoded along with keypoints $k_S$, capturing head pose, and facial expression vectors $e_S$ to a set-latent representation of the source person. The decoder attends this representation for a query pixel, conditioned on keypoints $k_D$ and a facial expression vector $e_D$ extracted from the driving frame. $\oplus$ denotes pixel-wise concatenation. Images from the VoxCeleb test set voxceleb.
  • Figure 2: Architecture details. Given the driving frame and source images, we extract facial keypoints and latent expression vectors. Extracted source information are used to generate the input representation of the Patch CNN. The encoder infers the set-latent source face representation from the patch embeddings as in SRT srt. The decoder is applied for each query pixel individually and is conditioned with the driving keypoints and the latent driving expression vector. For further implementation details we refer to the Supplementary Material.
  • Figure 3: Regularization benefit in Phase I training (relative motion transfer). If trained without statistical regularization (w$/$o Stat. Reg.), artifacts originating from the driving frame are visible in the background around the face boundary. When dropping regularization entirely (w$/$o Reg.), color distortions, background artifacts, and shape deformations are clearly visible. The lower sequence uses a source image from the CelebA-HQ dataset celebahq.
  • Figure 4: Cross-reenactment comparison with absolute motion transfer on the VoxCeleb test set voxceleb. We generate more accurate expressions with less shape deformations (higher ID preservation).
  • Figure 5: Comparison with SOTA in cross-reenactment with relative motion transfer. Our method is more robust to the alignment assumption for relative motion transfer, generates more accurate expressions, and handles larger pose offsets. All images are from the VoxCeleb test set voxceleb, except the lower block, which shows generalization to source images from the CelebA-HQ dataset celebahq.
  • ...and 11 more figures