LDFaceNet: Latent Diffusion-based Network for High-Fidelity Deepfake Generation

Dwij Mehta; Aditya Mehta; Pratik Narang

LDFaceNet: Latent Diffusion-based Network for High-Fidelity Deepfake Generation

Dwij Mehta, Aditya Mehta, Pratik Narang

TL;DR

LDFaceNet addresses high-fidelity face swapping by employing a guided latent-diffusion model conditioned on source identity and target segmentation. It introduces a dual-guidance framework with $G_{id}$ and $G_{seg}$ losses, combined as $G_{fac}= ext{lambda}_{id}(t) G_{id} + ext{lambda}_{seg}(t) G_{seg}$, and applies latent-level background blending to maintain scene coherence. The approach achieves superior fidelity and robustness without retraining, leveraging pre-trained components: a Latent Diffusion Model, ArcFace-based identity embedding, and a BisNet segmentation network. Ablation studies confirm the necessity of both identity and segmentation guidance, and results on CelebA show improved identity preservation and expression stability compared with state-of-the-art methods, including strong occlusion handling. Overall, LDFaceNet demonstrates the potential of guided diffusion for controllable, high-quality face swaps and lays groundwork for further diffusion-based advancements in this domain.

Abstract

Over the past decade, there has been tremendous progress in the domain of synthetic media generation. This is mainly due to the powerful methods based on generative adversarial networks (GANs). Very recently, diffusion probabilistic models, which are inspired by non-equilibrium thermodynamics, have taken the spotlight. In the realm of image generation, diffusion models (DMs) have exhibited remarkable proficiency in producing both realistic and heterogeneous imagery through their stochastic sampling procedure. This paper proposes a novel facial swapping module, termed as LDFaceNet (Latent Diffusion based Face Swapping Network), which is based on a guided latent diffusion model that utilizes facial segmentation and facial recognition modules for a conditioned denoising process. The model employs a unique loss function to offer directional guidance to the diffusion process. Notably, LDFaceNet can incorporate supplementary facial guidance for desired outcomes without any retraining. To the best of our knowledge, this represents the first application of the latent diffusion model in the face-swapping task without prior training. The results of this study demonstrate that the proposed method can generate extremely realistic and coherent images by leveraging the potential of the diffusion model for facial swapping, thereby yielding superior visual outcomes and greater diversity.

LDFaceNet: Latent Diffusion-based Network for High-Fidelity Deepfake Generation

TL;DR

LDFaceNet addresses high-fidelity face swapping by employing a guided latent-diffusion model conditioned on source identity and target segmentation. It introduces a dual-guidance framework with

and

losses, combined as

, and applies latent-level background blending to maintain scene coherence. The approach achieves superior fidelity and robustness without retraining, leveraging pre-trained components: a Latent Diffusion Model, ArcFace-based identity embedding, and a BisNet segmentation network. Ablation studies confirm the necessity of both identity and segmentation guidance, and results on CelebA show improved identity preservation and expression stability compared with state-of-the-art methods, including strong occlusion handling. Overall, LDFaceNet demonstrates the potential of guided diffusion for controllable, high-quality face swaps and lays groundwork for further diffusion-based advancements in this domain.

Abstract

Paper Structure (13 sections, 6 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 13 sections, 6 equations, 6 figures, 1 table, 1 algorithm.

Introduction
Related Work
Models for Image Synthesis
Face Swapping Models
Preliminary: Diffusion Models
Methodology
Source Identity Guided Diffusion
Target Segmentation Guided Diffusion
Background Preservation
Results and Discussion
Quantitative and Qualitative results
Ablation Study
Conclusion

Figures (6)

Figure 1: Sample output of LDFaceNet. Compared to recent state-of-the-art methods such as E4S (CVPR'23 liu2023e4s), the results produced by LDFaceNet are significantly better. This particular example also illustrates that our method performs much better in handling occlusions over the target face than other generative methods. Further details are in the results section \ref{['sec:results']}.
Figure 2: Proposed sampling process. The sampling process begins by encoding the target image into a latent vector using an encoder. The encoder and decoder used in this method come from the same autoencoder based on previous work by Esser et al. esser2021taming. Noise is added to this latent vector according to the diffusion noise schedule. Subsequently, a pre-trained U-Net is used to denoise this latent vector. The output of the U-Net is then conditioned using our novel facial guidance module. A downsampled facial mask ensures the masked area acquires the necessary facial characteristics through facial guidance while the background remains constant. Finally, after completing the denoising process, we pass the final latent vector $z_0$ into a decoder to get the swapped image. This entire process is detailed in Algorithm \ref{['algo:LDFaceNet']}.
Figure 3: Facial Guidance Module. The latent vector $pred_{z_0}$ is estimated from the output ($\epsilon_t$) of the denoising U-Net. $pred_{z_0}$ is upsampled to get $\widehat{x}$, which approximates what our swapped image would look like after the entire denoising process. $\widehat{x}$ is then used within the identity and segmentation guided modules since these involve using pretrained classifiers trained on normal images and not latent vectors. The embeddings of $\widehat{x}$, $x_{src}$, and $x_{targ}$ are then used as given in Algorithm \ref{['algo:LDFaceNet']} to calculate facial and segmentation guidance loss modules. These are combined to form the complete facial guidance loss term. The gradient of this facial loss term with respect to $z_t$ is used to guide the reverse diffusion process.
Figure 4: Qualitative Results. Our method achieves high-fidelity results, better preserving source identity and target facial attributes than other methods. It also handles occlusions and partial views robustly.
Figure 5: Comparative performance of ablation experiments. x-axis represents the three variants of LDFaceNet. The three lines describe the performance of each variant for three metrics.
...and 1 more figures

LDFaceNet: Latent Diffusion-based Network for High-Fidelity Deepfake Generation

TL;DR

Abstract

LDFaceNet: Latent Diffusion-based Network for High-Fidelity Deepfake Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)