SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder

Jaeseong Lee; Junha Hyung; Sohyun Jeong; Jaegul Choo

SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder

Jaeseong Lee, Junha Hyung, Sohyun Jeong, Jaegul Choo

TL;DR

SelfSwapper introduces SAMAE, a self-supervised Shape-Agnostic Masked AutoEncoder for face swapping that avoids target identity leakage and improves cross-identity realism. By disentangling identity from non-identity attributes and leveraging 3DMM-based geometry, a foreground mask, and learnable skin/albedo representations, SAMAE enables robust cross-identity swaps; it further mitigates shape misalignment and volume discrepancies via perforation confusion and random mesh scaling. Empirically, SAMAE achieves state-of-the-art performance on standard benchmarks, with strong qualitative results and ablations validating the effectiveness of the proposed techniques. The approach offers a robust, generalizable framework for realistic, privacy-conscious face swapping with reduced leakage and well-preserved target illumination and geometry, while acknowledging ethical considerations and potential future enhancements.

Abstract

Face swapping has gained significant attention for its varied applications. Most previous face swapping approaches have relied on the seesaw game training scheme, also known as the target-oriented approach. However, this often leads to instability in model training and results in undesired samples with blended identities due to the target identity leakage problem. Source-oriented methods achieve more stable training with self-reconstruction objective but often fail to accurately reflect target image's skin color and illumination. This paper introduces the Shape Agnostic Masked AutoEncoder (SAMAE) training scheme, a novel self-supervised approach that combines the strengths of both target-oriented and source-oriented approaches. Our training scheme addresses the limitations of traditional training methods by circumventing the conventional seesaw game and introducing clear ground truth through its self-reconstruction training regime. Our model effectively mitigates identity leakage and reflects target albedo and illumination through learned disentangled identity and non-identity features. Additionally, we closely tackle the shape misalignment and volume discrepancy problems with new techniques, including perforation confusion and random mesh scaling. SAMAE establishes a new state-of-the-art, surpassing other baseline methods, preserving both identity and non-identity attributes without sacrificing on either aspect.

SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder

TL;DR

Abstract

Paper Structure (14 sections, 2 equations, 7 figures, 2 tables)

This paper contains 14 sections, 2 equations, 7 figures, 2 tables.

Introduction
Related Work
Backgrounds
Method
Shape Agnostic Masked AutoEncoder
Perforation Confusion
Random Mesh Scaling
Disentangling Albedo Condition
Training Objectives
Experiments
Qualitative Comparisons
Quantitative Comparisons
Ablation Study
Discussion

Figures (7)

Figure 1: Real-world application of our model. This figure showcases results with the in-the-wild samples, navy circles for the source, pink circles for the target, and overlaps showing the generated outputs. The second row displays one-source multi-target results. Our model accurately transforms the target face to match the source while faithfully preserving the target attributes such as the skin color, pose, expression, hair, background, and gaze. This showcases our model's robustness on in-the-wild samples and real-world applicability for diverse facial images. For resolutions beyond $256\times256$, an off-the-shelf super-resolution model restoreformer is used.
Figure 2: (A) Conceptual comparison between prior works and our method. Prior works rely on a seesaw game of two potentially conflicting losses: reconstruction loss and identity loss. On the other hand, our method leverages a self-supervised approach with a clear ground truth, which allows for more stable training. (B) Comparing our base approach (Ours Base) with our enhanced method (Ours Full), which includes techniques like perforation confusion and random mesh scaling. Green masks represent target-posed source 3DMM masks, red masks indicate target 3DMM masks, and orange masks denote their intersection. The first row shows that when the source face is larger than the target's, the jaw is cut off. The second row shows the opposite case, where the base model fails to inpaint the remaining regions effectively, while Ours Full generates realistic face-swapped outputs.
Figure 3: Overall pipeline of our method. In the training phase (left), we employ self-reconstruction scheme with perforation confusion and random mesh scaling, enhancing shape agnostic robustness for SAMAE training. During inference (right), this training enables the model to efficiently perform cross-identity face swapping by disentangling ID and non-ID attributes.
Figure 4: Comparison among target-oriented baselines. (Top) Other baselines struggle to replicate the source's facial features such as facial contours and volumes (e.g., jaw shape and facial scale) and (Bottom) inner facial traits (e.g., pupil color, beard, and cheekbone). In contrast, Ours conveys these with high-fidelity. Pay attention to the red and orange indicators for detailed comparison.
Figure 5: Comparison among source-oriented baselines and 2-dimensional graph comparison on Identity scores and FID.
...and 2 more figures

SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder

TL;DR

Abstract

SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder

Authors

TL;DR

Abstract

Table of Contents

Figures (7)