Table of Contents
Fetching ...

DiffusionAct: Controllable Diffusion Autoencoder for One-shot Face Reenactment

Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, Ioannis Patras, Georgios Tzimiropoulos

TL;DR

DiffusionAct tackles one-shot face reenactment by pairing a pre-trained diffusion autoencoder with a learnable reenactment encoder conditioned on target facial landmarks and gaze. The reenactment encoder predicts a semantic code $\mathbf{z}_r$ that guides a DDIM sampler to render $\mathbf{x}_0^r$ preserving the source identity while adopting the target pose and expressions. The model is trained in two stages—self reenactment pre-training to align $\mathbf{z}_r$ with $\mathbf{z}_t$, and a main stage with reconstruction and pose losses—using a suite of losses, including $\mathcal{L}_{pix}$, $\mathcal{L}_{per}$, $\mathcal{L}_{id}$, $\mathcal{L}_{bg}$, $\mathcal{L}_{st}$, and pose terms $\mathcal{L}_g$, $\mathcal{L}_{sh}$, $\mathcal{L}_{hp}$. Evaluations on VoxCeleb1/2 show superior image fidelity and robust pose transfer, often outperforming GAN-, StyleGAN2-, and diffusion-based baselines, while maintaining source appearance details such as hair and glasses. The approach leverages 3DMM-based landmarks via EMOCA to avoid identity leakage and demonstrates competitive results under cross-subject conditions, with slower inference owing to diffusion sampling as a noted limitation. Overall, DiffusionAct provides a practical, one-shot, controllable path for realistic face reenactment using pre-trained diffusion models without subject-specific fine-tuning.

Abstract

Video-driven neural face reenactment aims to synthesize realistic facial images that successfully preserve the identity and appearance of a source face, while transferring the target head pose and facial expressions. Existing GAN-based methods suffer from either distortions and visual artifacts or poor reconstruction quality, i.e., the background and several important appearance details, such as hair style/color, glasses and accessories, are not faithfully reconstructed. Recent advances in Diffusion Probabilistic Models (DPMs) enable the generation of high-quality realistic images. To this end, in this paper we present DiffusionAct, a novel method that leverages the photo-realistic image generation of diffusion models to perform neural face reenactment. Specifically, we propose to control the semantic space of a Diffusion Autoencoder (DiffAE), in order to edit the facial pose of the input images, defined as the head pose orientation and the facial expressions. Our method allows one-shot, self, and cross-subject reenactment, without requiring subject-specific fine-tuning. We compare against state-of-the-art GAN-, StyleGAN2-, and diffusion-based methods, showing better or on-par reenactment performance.

DiffusionAct: Controllable Diffusion Autoencoder for One-shot Face Reenactment

TL;DR

DiffusionAct tackles one-shot face reenactment by pairing a pre-trained diffusion autoencoder with a learnable reenactment encoder conditioned on target facial landmarks and gaze. The reenactment encoder predicts a semantic code that guides a DDIM sampler to render preserving the source identity while adopting the target pose and expressions. The model is trained in two stages—self reenactment pre-training to align with , and a main stage with reconstruction and pose losses—using a suite of losses, including , , , , , and pose terms , , . Evaluations on VoxCeleb1/2 show superior image fidelity and robust pose transfer, often outperforming GAN-, StyleGAN2-, and diffusion-based baselines, while maintaining source appearance details such as hair and glasses. The approach leverages 3DMM-based landmarks via EMOCA to avoid identity leakage and demonstrates competitive results under cross-subject conditions, with slower inference owing to diffusion sampling as a noted limitation. Overall, DiffusionAct provides a practical, one-shot, controllable path for realistic face reenactment using pre-trained diffusion models without subject-specific fine-tuning.

Abstract

Video-driven neural face reenactment aims to synthesize realistic facial images that successfully preserve the identity and appearance of a source face, while transferring the target head pose and facial expressions. Existing GAN-based methods suffer from either distortions and visual artifacts or poor reconstruction quality, i.e., the background and several important appearance details, such as hair style/color, glasses and accessories, are not faithfully reconstructed. Recent advances in Diffusion Probabilistic Models (DPMs) enable the generation of high-quality realistic images. To this end, in this paper we present DiffusionAct, a novel method that leverages the photo-realistic image generation of diffusion models to perform neural face reenactment. Specifically, we propose to control the semantic space of a Diffusion Autoencoder (DiffAE), in order to edit the facial pose of the input images, defined as the head pose orientation and the facial expressions. Our method allows one-shot, self, and cross-subject reenactment, without requiring subject-specific fine-tuning. We compare against state-of-the-art GAN-, StyleGAN2-, and diffusion-based methods, showing better or on-par reenactment performance.
Paper Structure (20 sections, 17 equations, 19 figures, 8 tables)

This paper contains 20 sections, 17 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Our DPM-based method, DiffusionAct, performs one-shot self (top row) and cross-subject (bottom row) neural face reenactment. We demonstrate that, compared to current state-of-the-art methods, namely DaGAN hong2022depth, Face2Face yang2022face2face, DPE pang2023dpe and HyperReenact bounareli2023hyperreenact, DiffusionAct produces realistic, artifact-free images, accurately transfers the target head pose and expression, and faithfully reconstructs the source identity and appearance across challenging conditions, e.g., large head pose movements.
  • Figure 2: Overview of the proposed method: Given a pair of a source ($\mathbf{x}_0^s$) and a target ($\mathbf{x}_0^t$) images, we condition the reenactment encoder $\mathcal{E}_r$ on the target facial landmarks $\mathbf{y}_t$, in order to predict the reenactment semantic code $\mathbf{z}_r$ that, when decoded by the DDIM, generates the reenacted image $\mathbf{x}_0^r$ that captures the source identity/appearance and the target head pose and facial expressions.
  • Figure 3: Qualitative comparisons on self (top 4 rows) and cross-subject (bottom 4 rows) reenactment on VoxCeleb1 Nagrani17 dataset.
  • Figure 4: Comparisons against HyperReenact bounareli2023hyperreenact in both self (left figure) and cross-subject (right figure) reenactment. As clearly shown, our method achieves more accurate and realistic image reconstructions compared to HyperReenact.
  • Figure 5: Qualitative comparisons of different training choices of our framework, i.e., without pre-training the reenactment encoder ("w/o pre-training"), without the training protocol that involves both self reenactment and image reconstruction tasks ("w/o batch split") and without fine-tuning the DDIM decoder ("w/o fine-tuning").
  • ...and 14 more figures