DiffusionAct: Controllable Diffusion Autoencoder for One-shot Face Reenactment
Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, Ioannis Patras, Georgios Tzimiropoulos
TL;DR
DiffusionAct tackles one-shot face reenactment by pairing a pre-trained diffusion autoencoder with a learnable reenactment encoder conditioned on target facial landmarks and gaze. The reenactment encoder predicts a semantic code $\mathbf{z}_r$ that guides a DDIM sampler to render $\mathbf{x}_0^r$ preserving the source identity while adopting the target pose and expressions. The model is trained in two stages—self reenactment pre-training to align $\mathbf{z}_r$ with $\mathbf{z}_t$, and a main stage with reconstruction and pose losses—using a suite of losses, including $\mathcal{L}_{pix}$, $\mathcal{L}_{per}$, $\mathcal{L}_{id}$, $\mathcal{L}_{bg}$, $\mathcal{L}_{st}$, and pose terms $\mathcal{L}_g$, $\mathcal{L}_{sh}$, $\mathcal{L}_{hp}$. Evaluations on VoxCeleb1/2 show superior image fidelity and robust pose transfer, often outperforming GAN-, StyleGAN2-, and diffusion-based baselines, while maintaining source appearance details such as hair and glasses. The approach leverages 3DMM-based landmarks via EMOCA to avoid identity leakage and demonstrates competitive results under cross-subject conditions, with slower inference owing to diffusion sampling as a noted limitation. Overall, DiffusionAct provides a practical, one-shot, controllable path for realistic face reenactment using pre-trained diffusion models without subject-specific fine-tuning.
Abstract
Video-driven neural face reenactment aims to synthesize realistic facial images that successfully preserve the identity and appearance of a source face, while transferring the target head pose and facial expressions. Existing GAN-based methods suffer from either distortions and visual artifacts or poor reconstruction quality, i.e., the background and several important appearance details, such as hair style/color, glasses and accessories, are not faithfully reconstructed. Recent advances in Diffusion Probabilistic Models (DPMs) enable the generation of high-quality realistic images. To this end, in this paper we present DiffusionAct, a novel method that leverages the photo-realistic image generation of diffusion models to perform neural face reenactment. Specifically, we propose to control the semantic space of a Diffusion Autoencoder (DiffAE), in order to edit the facial pose of the input images, defined as the head pose orientation and the facial expressions. Our method allows one-shot, self, and cross-subject reenactment, without requiring subject-specific fine-tuning. We compare against state-of-the-art GAN-, StyleGAN2-, and diffusion-based methods, showing better or on-par reenactment performance.
