DiffSwap++: 3D Latent-Controlled Diffusion for Identity-Preserving Face Swapping
Weston Bondurant, Arkaprava Sinha, Hieu Le, Srijan Das, Stephanie Schuckers
TL;DR
DiffSwap++ tackles identity-preserving face swapping by introducing 3D facial latent features as training-time conditioning for a diffusion-based inpainting pipeline, enabling stronger disentanglement of identity from pose and expression. The method fuses 3D latent cues with 2D identity embeddings and landmarks, via cross-attention in a latent diffusion model, and optimizes a two-stage training objective combining reconstruction, identity, and perceptual losses. Empirical results on CelebA-HQ, FFHQ, and CelebV-Text demonstrate state-of-the-art identity retention with competitive pose/expression fidelity, supported by a biometric-style evaluation (IAPAR/FRR) and a user study. The work highlights 3D-aware supervision as a key factor for robust, photorealistic face swaps and suggests future work on 3D-aware detection of manipulated faces.
Abstract
Diffusion-based approaches have recently achieved strong results in face swapping, offering improved visual quality over traditional GAN-based methods. However, even state-of-the-art models often suffer from fine-grained artifacts and poor identity preservation, particularly under challenging poses and expressions. A key limitation of existing approaches is their failure to meaningfully leverage 3D facial structure, which is crucial for disentangling identity from pose and expression. In this work, we propose DiffSwap++, a novel diffusion-based face-swapping pipeline that incorporates 3D facial latent features during training. By guiding the generation process with 3D-aware representations, our method enhances geometric consistency and improves the disentanglement of facial identity from appearance attributes. We further design a diffusion architecture that conditions the denoising process on both identity embeddings and facial landmarks, enabling high-fidelity and identity-preserving face swaps. Extensive experiments on CelebA, FFHQ, and CelebV-Text demonstrate that DiffSwap++ outperforms prior methods in preserving source identity while maintaining target pose and expression. Additionally, we introduce a biometric-style evaluation and conduct a user study to further validate the realism and effectiveness of our approach. Code will be made publicly available at https://github.com/WestonBond/DiffSwapPP
