Table of Contents
Fetching ...

HiFiVFS: High Fidelity Video Face Swapping

Xu Chen, Keke He, Junwei Zhu, Yanhao Ge, Wei Li, Chengjie Wang

TL;DR

HiFiVFS addresses the challenge of high-fidelity, temporally stable video face swapping by extending Stable Video Diffusion with a multi-frame, diffusion-based pipeline. It introduces Fine-grained Attributes Learning (FAL) to disentangle and preserve detailed attributes (e.g., lighting, makeup) and Detailed Identity Learning (DIL) to enrich identity representation with tokens for robust cross-frame attention, all guided by temporal identity injection. Leveraging temporal attention and a temporal diffusion framework, HiFiVFS achieves state-of-the-art performance on FF++ and VFHQ-FS, excelling in identity preservation, attribute detail, and video stability. The work highlights practical impact for media production and privacy, while also acknowledging potential misuse and diffusion-sampling limitations that motivate future efficiency improvements.

Abstract

Face swapping aims to generate results that combine the identity from the source with attributes from the target. Existing methods primarily focus on image-based face swapping. When processing videos, each frame is handled independently, making it difficult to ensure temporal stability. From a model perspective, face swapping is gradually shifting from generative adversarial networks (GANs) to diffusion models (DMs), as DMs have been shown to possess stronger generative capabilities. Current diffusion-based approaches often employ inpainting techniques, which struggle to preserve fine-grained attributes like lighting and makeup. To address these challenges, we propose a high fidelity video face swapping (HiFiVFS) framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion (SVD). We build a fine-grained attribute module to extract identity-disentangled and fine-grained attribute features through identity desensitization and adversarial learning. Additionally, We introduce detailed identity injection to further enhance identity similarity. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.

HiFiVFS: High Fidelity Video Face Swapping

TL;DR

HiFiVFS addresses the challenge of high-fidelity, temporally stable video face swapping by extending Stable Video Diffusion with a multi-frame, diffusion-based pipeline. It introduces Fine-grained Attributes Learning (FAL) to disentangle and preserve detailed attributes (e.g., lighting, makeup) and Detailed Identity Learning (DIL) to enrich identity representation with tokens for robust cross-frame attention, all guided by temporal identity injection. Leveraging temporal attention and a temporal diffusion framework, HiFiVFS achieves state-of-the-art performance on FF++ and VFHQ-FS, excelling in identity preservation, attribute detail, and video stability. The work highlights practical impact for media production and privacy, while also acknowledging potential misuse and diffusion-sampling limitations that motivate future efficiency improvements.

Abstract

Face swapping aims to generate results that combine the identity from the source with attributes from the target. Existing methods primarily focus on image-based face swapping. When processing videos, each frame is handled independently, making it difficult to ensure temporal stability. From a model perspective, face swapping is gradually shifting from generative adversarial networks (GANs) to diffusion models (DMs), as DMs have been shown to possess stronger generative capabilities. Current diffusion-based approaches often employ inpainting techniques, which struggle to preserve fine-grained attributes like lighting and makeup. To address these challenges, we propose a high fidelity video face swapping (HiFiVFS) framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion (SVD). We build a fine-grained attribute module to extract identity-disentangled and fine-grained attribute features through identity desensitization and adversarial learning. Additionally, We introduce detailed identity injection to further enhance identity similarity. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.

Paper Structure

This paper contains 18 sections, 8 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Face swapping results of HiFiVFS. The face in the source image (orange) is taken to replace the face in the target video (blue).
  • Figure 2: Training pipeline of face swapping methods. (a) GAN-based methods achieve feature disentanglement by using attribute and identity loss along with adversarial learning. (b) Diffusion-based methods construct an inpainting data flow that leverages pre-trained identity and attribute features to fill in facial areas. (c) Our HiFiVFS is designed for video face swapping by incorporating temporal attention on multiple frames and introducing temporal identity injection. We also introduce a fine-grained attribute extractor and a detailed identity tokenizer to improve control over attributes and identities.
  • Figure 3: Pipeline of our proposed HiFiVFS, including training and inference phases. HiFiVFS is primarily trained based on the SVD blattmann2023svd framework, utilizing multi-frame input and a temporal attention to ensure the stability of the generated videos. In the training phase, HiFiVFS introduces fine-grained attribute learning (FAL) and detailed identity learning (DIL). In FAL, attribute disentanglement and enhancement are achieved through identity desensitization and adversarial learning. DIL uses more face swapping suited ID features to further boost identity similarity. In the inference phase, FAL only retains $E_{att}$ for attribute extraction, making the testing process more convenient. It is noted that HiFiVFS is trained and tested in the latent spacerombach2022high, but for visualization purposes, we illustrate all processes in the original image space.
  • Figure 4: VFHQ-FS results compared with other methods. The source image of each example is placed in the corresponding top-left position, and the target videos are in the first row. The complete video comparisons are included in the Supplementary Materials.
  • Figure 5: FF++ results compared with FaceShifter li2019faceshifter, BlendFace shiohara2023blendface and Face-Adapter han2024face.
  • ...and 4 more figures