Table of Contents
Fetching ...

High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning

Dailan He, Xiahong Wang, Shulun Wang, Guanglu Song, Bingqi Ma, Hao Shao, Yu Liu, Hongsheng Li

TL;DR

This work tackles diffusion-based face swapping by addressing the competing demands of preserving a source identity while matching target attributes. It introduces an identity-constrained diffusion framework with decoupled identity and attribute conditioning, implemented via a three-stage training pipeline and a post-training refinement that incorporates identity and adversarial losses, all built atop a latent diffusion foundation. The method leverages ArcFace and DINOv2 for identity cues, SimSwap for attributes, and GLIGEN adapters for injection, achieving state-of-the-art identity similarity and strong attribute fidelity on FFHQ with robust stylized generalization. Quantitative and human evaluations show substantial gains in Fréchet Inception Distance, identity retrieval, and perceived fidelity, validating the effectiveness of decoupled conditioning and staged optimization. The proposed approach not only advances high-fidelity face swapping but also provides a broadly applicable strategy for multi-condition diffusion tasks in conditional image generation.

Abstract

Face swapping aims to seamlessly transfer a source facial identity onto a target while preserving target attributes such as pose and expression. Diffusion models, known for their superior generative capabilities, have recently shown promise in advancing face-swapping quality. This paper addresses two key challenges in diffusion-based face swapping: the prioritized preservation of identity over target attributes and the inherent conflict between identity and attribute conditioning. To tackle these issues, we introduce an identity-constrained attribute-tuning framework for face swapping that first ensures identity preservation and then fine-tunes for attribute alignment, achieved through a decoupled condition injection. We further enhance fidelity by incorporating identity and adversarial losses in a post-training refinement stage. Our proposed identity-constrained diffusion-based face-swapping model outperforms existing methods in both qualitative and quantitative evaluations, demonstrating superior identity similarity and attribute consistency, achieving a new state-of-the-art performance in high-fidelity face swapping.

High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning

TL;DR

This work tackles diffusion-based face swapping by addressing the competing demands of preserving a source identity while matching target attributes. It introduces an identity-constrained diffusion framework with decoupled identity and attribute conditioning, implemented via a three-stage training pipeline and a post-training refinement that incorporates identity and adversarial losses, all built atop a latent diffusion foundation. The method leverages ArcFace and DINOv2 for identity cues, SimSwap for attributes, and GLIGEN adapters for injection, achieving state-of-the-art identity similarity and strong attribute fidelity on FFHQ with robust stylized generalization. Quantitative and human evaluations show substantial gains in Fréchet Inception Distance, identity retrieval, and perceived fidelity, validating the effectiveness of decoupled conditioning and staged optimization. The proposed approach not only advances high-fidelity face swapping but also provides a broadly applicable strategy for multi-condition diffusion tasks in conditional image generation.

Abstract

Face swapping aims to seamlessly transfer a source facial identity onto a target while preserving target attributes such as pose and expression. Diffusion models, known for their superior generative capabilities, have recently shown promise in advancing face-swapping quality. This paper addresses two key challenges in diffusion-based face swapping: the prioritized preservation of identity over target attributes and the inherent conflict between identity and attribute conditioning. To tackle these issues, we introduce an identity-constrained attribute-tuning framework for face swapping that first ensures identity preservation and then fine-tunes for attribute alignment, achieved through a decoupled condition injection. We further enhance fidelity by incorporating identity and adversarial losses in a post-training refinement stage. Our proposed identity-constrained diffusion-based face-swapping model outperforms existing methods in both qualitative and quantitative evaluations, demonstrating superior identity similarity and attribute consistency, achieving a new state-of-the-art performance in high-fidelity face swapping.

Paper Structure

This paper contains 15 sections, 5 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Face-swapping examples of proposed approach. Given the source face (white-border) and target face (left), the face-swapping applications aim to transfer the identity of the source image to the attributes (expression, pose, etc.) of the target face.
  • Figure 2: The identity-constrained attribute tuning scheme. We aim to obtain the red-dot sample with constrained identity and attribute. We propose to divide this process into 3 stages.
  • Figure 3: Identity and attribute compete against each other. The former pushes the model output closer to the source face, while the latter aligns the model to the target face.
  • Figure 4: Various designs for condition injection. Visual embeddings are injected into the middle attention layers of the denoising UNet in LDMs, with cross-attention adapters ye2023ipadapterli2023gligen.
  • Figure 5: Diagram of the proposed multi-stage training scheme with identity-constrained facial conditioning.
  • ...and 4 more figures