Table of Contents
Fetching ...

E4S: Fine-grained Face Swapping via Editing With Regional GAN Inversion

Maomao Li, Ge Yuan, Cairong Wang, Zhian Liu, Yong Zhang, Yongwei Nie, Jue Wang, Dong Xu

TL;DR

This paper proposes a novel approach to face swapping from the perspective of fine-grained facial editing, dubbed "editing for swapping"(E4S), which outperforms existing methods in preserving texture, shape, and lighting.

Abstract

This paper proposes a novel approach to face swapping from the perspective of fine-grained facial editing, dubbed "editing for swapping" (E4S). The traditional face swapping methods rely on global feature extraction and fail to preserve the detailed source identity. In contrast, we propose a Regional GAN Inversion (RGI) method, which allows the explicit disentanglement of shape and texture. Specifically, our E4S performs face swapping in the latent space of a pretrained StyleGAN, where a multi-scale mask-guided encoder is applied to project the texture of each facial component into regional style codes and a mask-guided injection module manipulating feature maps with the style codes. Based on this disentanglement, face swapping can be simplified as style and mask swapping. Besides, due to the large lighting condition gap, transferring the source skin into the target image may lead to disharmony lighting. We propose a re-coloring network to make the swapped face maintain the target lighting condition while preserving the source skin. Further, to deal with the potential mismatch areas during mask exchange, we design a face inpainting module to refine the face shape. The extensive comparisons with state-of-the-art methods demonstrate that our E4S outperforms existing methods in preserving texture, shape, and lighting. Our implementation is available at https://github.com/e4s2024/E4S2024.

E4S: Fine-grained Face Swapping via Editing With Regional GAN Inversion

TL;DR

This paper proposes a novel approach to face swapping from the perspective of fine-grained facial editing, dubbed "editing for swapping"(E4S), which outperforms existing methods in preserving texture, shape, and lighting.

Abstract

This paper proposes a novel approach to face swapping from the perspective of fine-grained facial editing, dubbed "editing for swapping" (E4S). The traditional face swapping methods rely on global feature extraction and fail to preserve the detailed source identity. In contrast, we propose a Regional GAN Inversion (RGI) method, which allows the explicit disentanglement of shape and texture. Specifically, our E4S performs face swapping in the latent space of a pretrained StyleGAN, where a multi-scale mask-guided encoder is applied to project the texture of each facial component into regional style codes and a mask-guided injection module manipulating feature maps with the style codes. Based on this disentanglement, face swapping can be simplified as style and mask swapping. Besides, due to the large lighting condition gap, transferring the source skin into the target image may lead to disharmony lighting. We propose a re-coloring network to make the swapped face maintain the target lighting condition while preserving the source skin. Further, to deal with the potential mismatch areas during mask exchange, we design a face inpainting module to refine the face shape. The extensive comparisons with state-of-the-art methods demonstrate that our E4S outperforms existing methods in preserving texture, shape, and lighting. Our implementation is available at https://github.com/e4s2024/E4S2024.
Paper Structure (29 sections, 23 equations, 14 figures, 3 tables)

This paper contains 29 sections, 23 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: E4S framework overview. (a) We first crop the face region of the source $S$ and the target $T$ to obtain $I_{\rm{s}}$ and $I_{\rm{t}}$. Then, a reenactment network $G_{\rm{r}}$ encourages $I_{\rm{s}}$ to have a similar pose and expression towards $I_{\rm{t}}$, obtaining the driven image $I_{\rm{d}}$. The segmentation masks of $I_{\rm{t}}$ and $I_{\rm{d}}$ are also estimated. (b) Then, the driven and target pairs $(I_{\rm{d}}, M_{\rm{d}})$ and $(I_{\rm{t}}, M_{\rm{t}})$ are fed into the mask-guided encoder $F_{\phi}$ to extract the per-region style codes to depict the texture respectively, producing texture codes $S_{\rm{d}}$ and $S_{\rm{t}}$. Next, we exchange the masks and the corresponding texture codes, obtaining $S_{\rm{swap}}$ which is then sent to the pre-trained StyleGAN generator $G_{\theta}$ with a mask-guided injection module to synthesize the naive swapped face $\tilde{I}$. (c) Finally, we propose a refinement stage, which includes a face re-coloring network $B_{\psi}$ for transfering the target lighting to $\tilde{I}$, and a face inpainting network $P_{\tau}$ for preserving a consistent shape with source face.
  • Figure 2: Overview of the proposed RGI. The input face $I$ together with the corresponding segmentation map $M$ are fed into a multi-scale encoder $F_{\phi}$ to extract the per-region texture vectors. The multi-scale texture vectors are then concatenated and passed through some MLPs to get the style codes resident in the latent space of StyleGAN. The regional style codes and the mask $M$ are used by our mask-guided StyleGAN generator to produce the reconstructed face $\tilde{I}$.
  • Figure 3: The comparison of the original StyleGAN and the proposed mask-guided StyleGAN. (a) The original StyleGAN contains consecutive convolution blocks. Each block contains a modulation, a demodulation, and a convolution layer. $W$ and $b$ denote the learnable kernel weights for each block, and $s$ denotes the style code. B is noise broadcast operation. An upsampling layer is used between every two blocks. (b) Our mask-guided StyleGAN regionally extends the convolution block. We sum up the intermediate feature maps of each region using its segmentation mask which is downsized in advance.
  • Figure 4: Illustration of our re-coloring network $B_{\psi}$. During training, we first obtain the random re-colored $I'_{\rm{AG}}$ and flipped $I'_{\rm{A}}$ and extract their features $f_{\rm{AG}}$ and $f_{\rm{A}}$ by FPN zhang2020cross. For each region $r$, we calculate an attention map between $f_{\rm{AG}}$ and $f_{\rm{A}}$, which is then multiplied with the downsampled $I'_{A}$ and composed to generate color guidance $\Pi$. Finally, a U-Net takes as input the concatenated $\Pi$, the input images $I'_{\rm{AG}}$, $I'_{\rm{A}}$ and their segmentation masks, generating the re-colored result $\tilde{I}^*_{\rm{rec}}$, which is used to calculate the reconstruction loss with the flipped $I'_{\rm{A}}$. During testing, we adopt a same scheme to transfer the color from the target $I_{\rm{t}}$ to the naive swapped face $\Tilde{I}$, resulting in the re-colored swapped result $\tilde{I}^*_{\rm{rec}}$.
  • Figure 5: Illustration of inpainting network $P_{\tau}$, which adaptively inpaints the mismatch regions on the pixel levels. Given a driven face $I_{\rm{d}}$ and the pasted $\Tilde{I}_{\rm{rec}}$, we first calculate the mismatch regions mask $M_{\rm{r}}$. Then the image $\Tilde{I}_{\rm{rec}}$ along with the mismatch mask $M_{\rm{r}}$ are fed to an auto-encoder to inpaint the mismatch pixels. The generation process of the decoder is modulated by the scale factor $\alpha_1$ and $\alpha_2$ extracted from the area ratio $s$ of mismatch regions. The final inpainting result is denoted as $\tilde{I}_{\rm{inp}}$. For simplicity, the normalization layers are omitted in this figure.
  • ...and 9 more figures