Table of Contents
Fetching ...

VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping

Hao Shao, Shulun Wang, Yang Zhou, Guanglu Song, Dailan He, Shuo Qin, Zhuofan Zong, Bingqi Ma, Yu Liu, Hongsheng Li

TL;DR

VividFace introduces the first diffusion-based framework for video face swapping, addressing temporal coherence and large-pose challenges by leveraging an image-video hybrid training strategy. A VidFaceVAE unifies image and video processing in a shared latent space, while 3DMM conditioning, occlusion augmentation, and a novel AIDT dataset promote robust identity-attribute disentanglement. The approach demonstrates superior Fréchet Video Distance, temporal stability, and identity preservation with fewer inference steps than prior methods. This work advances practical, high-fidelity video face swapping and offers a foundation for robust editing in dynamic scenes.

Abstract

Video face swapping is becoming increasingly popular across various applications, yet existing methods primarily focus on static images and struggle with video face swapping because of temporal consistency and complex scenarios. In this paper, we present the first diffusion-based framework specifically designed for video face swapping. Our approach introduces a novel image-video hybrid training framework that leverages both abundant static image data and temporal video sequences, addressing the inherent limitations of video-only training. The framework incorporates a specially designed diffusion model coupled with a VidFaceVAE that effectively processes both types of data to better maintain temporal coherence of the generated videos. To further disentangle identity and pose features, we construct the Attribute-Identity Disentanglement Triplet (AIDT) Dataset, where each triplet has three face images, with two images sharing the same pose and two sharing the same identity. Enhanced with a comprehensive occlusion augmentation, this dataset also improves robustness against occlusions. Additionally, we integrate 3D reconstruction techniques as input conditioning to our network for handling large pose variations. Extensive experiments demonstrate that our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods, while requiring fewer inference steps. Our approach effectively mitigates key challenges in video face swapping, including temporal flickering, identity preservation, and robustness to occlusions and pose variations.

VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping

TL;DR

VividFace introduces the first diffusion-based framework for video face swapping, addressing temporal coherence and large-pose challenges by leveraging an image-video hybrid training strategy. A VidFaceVAE unifies image and video processing in a shared latent space, while 3DMM conditioning, occlusion augmentation, and a novel AIDT dataset promote robust identity-attribute disentanglement. The approach demonstrates superior Fréchet Video Distance, temporal stability, and identity preservation with fewer inference steps than prior methods. This work advances practical, high-fidelity video face swapping and offers a foundation for robust editing in dynamic scenes.

Abstract

Video face swapping is becoming increasingly popular across various applications, yet existing methods primarily focus on static images and struggle with video face swapping because of temporal consistency and complex scenarios. In this paper, we present the first diffusion-based framework specifically designed for video face swapping. Our approach introduces a novel image-video hybrid training framework that leverages both abundant static image data and temporal video sequences, addressing the inherent limitations of video-only training. The framework incorporates a specially designed diffusion model coupled with a VidFaceVAE that effectively processes both types of data to better maintain temporal coherence of the generated videos. To further disentangle identity and pose features, we construct the Attribute-Identity Disentanglement Triplet (AIDT) Dataset, where each triplet has three face images, with two images sharing the same pose and two sharing the same identity. Enhanced with a comprehensive occlusion augmentation, this dataset also improves robustness against occlusions. Additionally, we integrate 3D reconstruction techniques as input conditioning to our network for handling large pose variations. Extensive experiments demonstrate that our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods, while requiring fewer inference steps. Our approach effectively mitigates key challenges in video face swapping, including temporal flickering, identity preservation, and robustness to occlusions and pose variations.

Paper Structure

This paper contains 16 sections, 1 equation, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Face swapping results of VividFace at $512 \times 512$ resolution. Our method produces high-fidelity and vivid outputs that accurately follow both pose and expression changes.
  • Figure 2: Overview of the proposed framework. During training, our framework randomly chooses static images or video sequences as the training data. In addition to the noise $z_t$, three other types of inputs are integrated to guide the generation process: (1) a face region mask, which controls the generation of facial imagery; (2) a 3D reconstructed face, which helps guide the pose and expression, especially in cases of large pose variations; and (3) masked source images, which supply background information. These inputs are processed through the Backbone Network, which performs the denoising operation. Within the Backbone Network, we employ cross-attention and temporal attention mechanisms. The temporal attention module ensures temporal continuity and consistency across frames. Our face encoder extracts identity and texture features from the target face, as well as pose and expression details from the source face, and uses these features in cross-attention to produce realistic and high-fidelity results.
  • Figure 3: Overview of the proposed VidFaceVAE, capable of simultaneous encoding and decoding of both image and video data. Certain modules are specifically designed for video inputs, and image inputs bypass these modules as needed.
  • Figure 4: Visualization of our occlusion data augmentation, which improves the stability and consistency of the generated videos.
  • Figure 5: Visualization of our AIDT dataset. For video facial data, we present only the target and decoupling faces, as the source faces can be derived from any other frame within the same video clip.
  • ...and 3 more figures