Table of Contents
Fetching ...

FACEMUG: A Multimodal Generative and Fusion Framework for Local Facial Editing

Wanglong Lu, Jikai Wang, Xiaogang Jin, Xianta Jiang, Hanli Zhao

TL;DR

FACEMUG introduces a globally consistent local facial editing framework that unifies multiple modalities (sketches, semantic maps, colors, exemplars, text, and attribute labels) into StyleGAN's latent space. It deploys a novel multimodal aggregation and style fusion mechanism, a self-supervised latent warping module for pose alignment, and a latent- and feature-space fusion strategy via a refinement auto-encoder, yielding high-fidelity, region-specific edits with unedited regions preserved. The approach demonstrates strong quantitative and qualitative gains over SOTA multimodal and sketch/semantic-guided editors across CelebA-HQ and FFHQ, with fast inference. By eliminating the need for manual paired data and enabling incremental edits, FACEMUG offers a practical, scalable solution for diverse local facial editing tasks with fine-grained semantic control.

Abstract

Existing facial editing methods have achieved remarkable results, yet they often fall short in supporting multimodal conditional local facial editing. One of the significant evidences is that their output image quality degrades dramatically after several iterations of incremental editing, as they do not support local editing. In this paper, we present a novel multimodal generative and fusion framework for globally-consistent local facial editing (FACEMUG) that can handle a wide range of input modalities and enable fine-grained and semantic manipulation while remaining unedited parts unchanged. Different modalities, including sketches, semantic maps, color maps, exemplar images, text, and attribute labels, are adept at conveying diverse conditioning details, and their combined synergy can provide more explicit guidance for the editing process. We thus integrate all modalities into a unified generative latent space to enable multimodal local facial edits. Specifically, a novel multimodal feature fusion mechanism is proposed by utilizing multimodal aggregation and style fusion blocks to fuse facial priors and multimodalities in both latent and feature spaces. We further introduce a novel self-supervised latent warping algorithm to rectify misaligned facial features, efficiently transferring the pose of the edited image to the given latent codes. We evaluate our FACEMUG through extensive experiments and comparisons to state-of-the-art (SOTA) methods. The results demonstrate the superiority of FACEMUG in terms of editing quality, flexibility, and semantic control, making it a promising solution for a wide range of local facial editing tasks.

FACEMUG: A Multimodal Generative and Fusion Framework for Local Facial Editing

TL;DR

FACEMUG introduces a globally consistent local facial editing framework that unifies multiple modalities (sketches, semantic maps, colors, exemplars, text, and attribute labels) into StyleGAN's latent space. It deploys a novel multimodal aggregation and style fusion mechanism, a self-supervised latent warping module for pose alignment, and a latent- and feature-space fusion strategy via a refinement auto-encoder, yielding high-fidelity, region-specific edits with unedited regions preserved. The approach demonstrates strong quantitative and qualitative gains over SOTA multimodal and sketch/semantic-guided editors across CelebA-HQ and FFHQ, with fast inference. By eliminating the need for manual paired data and enabling incremental edits, FACEMUG offers a practical, scalable solution for diverse local facial editing tasks with fine-grained semantic control.

Abstract

Existing facial editing methods have achieved remarkable results, yet they often fall short in supporting multimodal conditional local facial editing. One of the significant evidences is that their output image quality degrades dramatically after several iterations of incremental editing, as they do not support local editing. In this paper, we present a novel multimodal generative and fusion framework for globally-consistent local facial editing (FACEMUG) that can handle a wide range of input modalities and enable fine-grained and semantic manipulation while remaining unedited parts unchanged. Different modalities, including sketches, semantic maps, color maps, exemplar images, text, and attribute labels, are adept at conveying diverse conditioning details, and their combined synergy can provide more explicit guidance for the editing process. We thus integrate all modalities into a unified generative latent space to enable multimodal local facial edits. Specifically, a novel multimodal feature fusion mechanism is proposed by utilizing multimodal aggregation and style fusion blocks to fuse facial priors and multimodalities in both latent and feature spaces. We further introduce a novel self-supervised latent warping algorithm to rectify misaligned facial features, efficiently transferring the pose of the edited image to the given latent codes. We evaluate our FACEMUG through extensive experiments and comparisons to state-of-the-art (SOTA) methods. The results demonstrate the superiority of FACEMUG in terms of editing quality, flexibility, and semantic control, making it a promising solution for a wide range of local facial editing tasks.

Paper Structure

This paper contains 42 sections, 24 equations, 25 figures, 7 tables, 3 algorithms.

Figures (25)

  • Figure 1: Examples demonstrating the superior performance of FACEMUG in high-quality globally consistent local facial editing, using subsets of the five modalities including semantic label, sketch, text, color, and exemplar image. Our method (top row) exhibits better visual quality and fidelity in incremental editing (the later editing taking the previous output image as input), compared to SOTA multimodal face editing methods: ColDiffusion huang2023collaborative (middle row) and Unite&Conquer nair2023unite (bottom row).
  • Figure 2: Overall pipeline of our FACEMUG globally-consistent local facial editing: the given attribute label (or text), random latent code $z$, and exemplar image $\mathbf{I}_{ex}$ are first processed through the exemplar style module, latent warping module, and the latent attribute editing module to get the edited latent codes. Simultaneously, the input pixel-wise multimodal inputs $\mathcal{X}$ and a binary mask $\mathbf{M}$ are fed into the multimodal aggregation module and the multimodal generator to get an edited realistic face image $\mathbf{I}_{out}$, where the manipulation of the masked regions in $\mathbf{M}$ is guided by multimodal inputs.
  • Figure 3: Illustration of our style fusion block. Conditioned by the modulated latent vector ${w}^*_{i+1}$, the block effectively integrates multi-scale facial features in both high-level and shallow-level feature spaces.
  • Figure 4: Illustration of the self-supervised training of our latent warping module. We employ the style encoder to project the augmented image to obtain the target latent codes $w^{ta}$. The source latent codes $w^{so}$ are sampled using interpolation between the initial latent codes $w^{ini}$ and the flipped latent codes $w^{f}$. The identity loss, the LPIPS loss, and the attribute loss are utilized as constraints to disentangle the identity and pose. This module effectively transfers the pose of $w^{ta}$ to the warped latent codes $w^{wa}$ while remaining other facial features unchanged. The inversion process is utilized for the visualization purpose.
  • Figure 5: Visual comparison to ColDiffusion huang2023collaborative and Unite&Conquer nair2023unite for text-driven multimodal facial editing. Our method produces visually appealing and globally consistent images with good responses to the corresponding multimodal inputs, and remains unmasked parts unchanged.
  • ...and 20 more figures