Table of Contents
Fetching ...

DiffSwap++: 3D Latent-Controlled Diffusion for Identity-Preserving Face Swapping

Weston Bondurant, Arkaprava Sinha, Hieu Le, Srijan Das, Stephanie Schuckers

TL;DR

DiffSwap++ tackles identity-preserving face swapping by introducing 3D facial latent features as training-time conditioning for a diffusion-based inpainting pipeline, enabling stronger disentanglement of identity from pose and expression. The method fuses 3D latent cues with 2D identity embeddings and landmarks, via cross-attention in a latent diffusion model, and optimizes a two-stage training objective combining reconstruction, identity, and perceptual losses. Empirical results on CelebA-HQ, FFHQ, and CelebV-Text demonstrate state-of-the-art identity retention with competitive pose/expression fidelity, supported by a biometric-style evaluation (IAPAR/FRR) and a user study. The work highlights 3D-aware supervision as a key factor for robust, photorealistic face swaps and suggests future work on 3D-aware detection of manipulated faces.

Abstract

Diffusion-based approaches have recently achieved strong results in face swapping, offering improved visual quality over traditional GAN-based methods. However, even state-of-the-art models often suffer from fine-grained artifacts and poor identity preservation, particularly under challenging poses and expressions. A key limitation of existing approaches is their failure to meaningfully leverage 3D facial structure, which is crucial for disentangling identity from pose and expression. In this work, we propose DiffSwap++, a novel diffusion-based face-swapping pipeline that incorporates 3D facial latent features during training. By guiding the generation process with 3D-aware representations, our method enhances geometric consistency and improves the disentanglement of facial identity from appearance attributes. We further design a diffusion architecture that conditions the denoising process on both identity embeddings and facial landmarks, enabling high-fidelity and identity-preserving face swaps. Extensive experiments on CelebA, FFHQ, and CelebV-Text demonstrate that DiffSwap++ outperforms prior methods in preserving source identity while maintaining target pose and expression. Additionally, we introduce a biometric-style evaluation and conduct a user study to further validate the realism and effectiveness of our approach. Code will be made publicly available at https://github.com/WestonBond/DiffSwapPP

DiffSwap++: 3D Latent-Controlled Diffusion for Identity-Preserving Face Swapping

TL;DR

DiffSwap++ tackles identity-preserving face swapping by introducing 3D facial latent features as training-time conditioning for a diffusion-based inpainting pipeline, enabling stronger disentanglement of identity from pose and expression. The method fuses 3D latent cues with 2D identity embeddings and landmarks, via cross-attention in a latent diffusion model, and optimizes a two-stage training objective combining reconstruction, identity, and perceptual losses. Empirical results on CelebA-HQ, FFHQ, and CelebV-Text demonstrate state-of-the-art identity retention with competitive pose/expression fidelity, supported by a biometric-style evaluation (IAPAR/FRR) and a user study. The work highlights 3D-aware supervision as a key factor for robust, photorealistic face swaps and suggests future work on 3D-aware detection of manipulated faces.

Abstract

Diffusion-based approaches have recently achieved strong results in face swapping, offering improved visual quality over traditional GAN-based methods. However, even state-of-the-art models often suffer from fine-grained artifacts and poor identity preservation, particularly under challenging poses and expressions. A key limitation of existing approaches is their failure to meaningfully leverage 3D facial structure, which is crucial for disentangling identity from pose and expression. In this work, we propose DiffSwap++, a novel diffusion-based face-swapping pipeline that incorporates 3D facial latent features during training. By guiding the generation process with 3D-aware representations, our method enhances geometric consistency and improves the disentanglement of facial identity from appearance attributes. We further design a diffusion architecture that conditions the denoising process on both identity embeddings and facial landmarks, enabling high-fidelity and identity-preserving face swaps. Extensive experiments on CelebA, FFHQ, and CelebV-Text demonstrate that DiffSwap++ outperforms prior methods in preserving source identity while maintaining target pose and expression. Additionally, we introduce a biometric-style evaluation and conduct a user study to further validate the realism and effectiveness of our approach. Code will be made publicly available at https://github.com/WestonBond/DiffSwapPP

Paper Structure

This paper contains 15 sections, 12 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Comparison between Diffswap++ and previous diffusion methods showcasing the lack of artifacts and deformations in our swapped outputs.
  • Figure 2: The 3D features are extracted from an encoder trained for 3D face reconstruction and are projector into a compatible feature dimension to be used as a conditioning feature in DiffSwap++.
  • Figure 3: Overview of the condition generation pipeline for DiffSwap++. We utilize the standard landmark and identity features alongside our 3D features, novel to diffusion face-swapping.
  • Figure 4: Training pipeline for DiffSwap++. . The left portion highlights our primary DDIM training pipeline where we perform both reconstruction and face-swapping. The right portion highlights our feature generation where we combine our conditioning features to prepare the input of our diffusion model.
  • Figure 5: A comparison of IAPAR for all models at different thresholds against FRR.
  • ...and 3 more figures