Table of Contents
Fetching ...

Realistic and Efficient Face Swapping: A Unified Approach with Diffusion Models

Sanoojan Baliah, Qinliang Lin, Shengcai Liao, Xiaodan Liang, Muhammad Haris Khan

TL;DR

This work proposes to reframe the face-swapping task as a self-supervised, train-time inpainting problem, enhancing the identity transfer while blending with the target image, and introduces a mask shuffling technique during inpainting training, which allows for a so-called universal model for swapping.

Abstract

Despite promising progress in face swapping task, realistic swapped images remain elusive, often marred by artifacts, particularly in scenarios involving high pose variation, color differences, and occlusion. To address these issues, we propose a novel approach that better harnesses diffusion models for face-swapping by making following core contributions. (a) We propose to re-frame the face-swapping task as a self-supervised, train-time inpainting problem, enhancing the identity transfer while blending with the target image. (b) We introduce a multi-step Denoising Diffusion Implicit Model (DDIM) sampling during training, reinforcing identity and perceptual similarities. (c) Third, we introduce CLIP feature disentanglement to extract pose, expression, and lighting information from the target image, improving fidelity. (d) Further, we introduce a mask shuffling technique during inpainting training, which allows us to create a so-called universal model for swapping, with an additional feature of head swapping. Ours can swap hair and even accessories, beyond traditional face swapping. Unlike prior works reliant on multiple off-the-shelf models, ours is a relatively unified approach and so it is resilient to errors in other off-the-shelf models. Extensive experiments on FFHQ and CelebA datasets validate the efficacy and robustness of our approach, showcasing high-fidelity, realistic face-swapping with minimal inference time. Our code is available at https://github.com/Sanoojan/REFace.

Realistic and Efficient Face Swapping: A Unified Approach with Diffusion Models

TL;DR

This work proposes to reframe the face-swapping task as a self-supervised, train-time inpainting problem, enhancing the identity transfer while blending with the target image, and introduces a mask shuffling technique during inpainting training, which allows for a so-called universal model for swapping.

Abstract

Despite promising progress in face swapping task, realistic swapped images remain elusive, often marred by artifacts, particularly in scenarios involving high pose variation, color differences, and occlusion. To address these issues, we propose a novel approach that better harnesses diffusion models for face-swapping by making following core contributions. (a) We propose to re-frame the face-swapping task as a self-supervised, train-time inpainting problem, enhancing the identity transfer while blending with the target image. (b) We introduce a multi-step Denoising Diffusion Implicit Model (DDIM) sampling during training, reinforcing identity and perceptual similarities. (c) Third, we introduce CLIP feature disentanglement to extract pose, expression, and lighting information from the target image, improving fidelity. (d) Further, we introduce a mask shuffling technique during inpainting training, which allows us to create a so-called universal model for swapping, with an additional feature of head swapping. Ours can swap hair and even accessories, beyond traditional face swapping. Unlike prior works reliant on multiple off-the-shelf models, ours is a relatively unified approach and so it is resilient to errors in other off-the-shelf models. Extensive experiments on FFHQ and CelebA datasets validate the efficacy and robustness of our approach, showcasing high-fidelity, realistic face-swapping with minimal inference time. Our code is available at https://github.com/Sanoojan/REFace.
Paper Structure (16 sections, 8 equations, 14 figures, 3 tables)

This paper contains 16 sections, 8 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: We show the Face Swapped results of our method. In all groups, the swapped face(right) should contain the identity information from source (left-bottom) and pose, expression and lighting conditions from target (left-top). All the images are in $512 \times 512$ resolution.
  • Figure 2: Our architecture pipeline for face swapping. The training pipeline contains two major components (a) the inpainting diffusion training, where we utilize the same image as reference (with reference augment) and inpaint (with face shape augment). (b) Here we provide different images and enforce identity loss and perceptual similarity with source and target respectively. On the bottom left, we depict the condition generation. Further, all the $z$-s in training pipeline are latents obtained from the forward diffusion, but for better understandability we show as images.
  • Figure 3: Example reenactment cut and paste swapping. Zoom in and view at the boundaries of outcome which has merging issues.
  • Figure 4: Qualitative comparison on CelebA dataset. Our approach demonstrates robustness in maintaining target environmental conditions (such as lighting) and preserving source ID information. Best viewed in color and zoom in.
  • Figure 5: Qualitative comparisons of our method with SOTA face swapping methods on FFHQ dataset. Best viewed in color and zoom-in.
  • ...and 9 more figures