Table of Contents
Fetching ...

MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach

Xin Zhang, Siting Huang, Xiangyang Luo, Yifan Xie, Weijiang Yu, Heng Chang, Fei Ma, Fei Yu

TL;DR

MuseFace tackles the challenge of controllable, diverse, and flexible text-driven face editing by recasting edits from pixel space to semantic space and introducing a two-stage diffusion framework. Stage 1 (Text-to-Mask) generates fine-grained semantic masks from text prompts, leveraging amodal segmentation data and a Mask-aware Autoencoder to preserve unchanged regions. Stage 2 (Semantic-aware Face Editing) applies a multimodal diffusion-based editing model conditioned on the generated masks to edit the image guided by a caption, with training updating only the conditional network. Empirically, MuseFace delivers superior fidelity, realism, and controllability across quantitative metrics and qualitative studies, including zero-shot and cross-ID editing scenarios, demonstrating strong practical impact for high-quality, text-controlled face editing.

Abstract

Face editing modifies the appearance of face, which plays a key role in customization and enhancement of personal images. Although much work have achieved remarkable success in text-driven face editing, they still face significant challenges as none of them simultaneously fulfill the characteristics of diversity, controllability and flexibility. To address this challenge, we propose MuseFace, a text-driven face editing framework, which relies solely on text prompt to enable face editing. Specifically, MuseFace integrates a Text-to-Mask diffusion model and a semantic-aware face editing model, capable of directly generating fine-grained semantic masks from text and performing face editing. The Text-to-Mask diffusion model provides \textit{diversity} and \textit{flexibility} to the framework, while the semantic-aware face editing model ensures \textit{controllability} of the framework. Our framework can create fine-grained semantic masks, making precise face editing possible, and significantly enhancing the controllability and flexibility of face editing models. Extensive experiments demonstrate that MuseFace achieves superior high-fidelity performance.

MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach

TL;DR

MuseFace tackles the challenge of controllable, diverse, and flexible text-driven face editing by recasting edits from pixel space to semantic space and introducing a two-stage diffusion framework. Stage 1 (Text-to-Mask) generates fine-grained semantic masks from text prompts, leveraging amodal segmentation data and a Mask-aware Autoencoder to preserve unchanged regions. Stage 2 (Semantic-aware Face Editing) applies a multimodal diffusion-based editing model conditioned on the generated masks to edit the image guided by a caption, with training updating only the conditional network. Empirically, MuseFace delivers superior fidelity, realism, and controllability across quantitative metrics and qualitative studies, including zero-shot and cross-ID editing scenarios, demonstrating strong practical impact for high-quality, text-controlled face editing.

Abstract

Face editing modifies the appearance of face, which plays a key role in customization and enhancement of personal images. Although much work have achieved remarkable success in text-driven face editing, they still face significant challenges as none of them simultaneously fulfill the characteristics of diversity, controllability and flexibility. To address this challenge, we propose MuseFace, a text-driven face editing framework, which relies solely on text prompt to enable face editing. Specifically, MuseFace integrates a Text-to-Mask diffusion model and a semantic-aware face editing model, capable of directly generating fine-grained semantic masks from text and performing face editing. The Text-to-Mask diffusion model provides \textit{diversity} and \textit{flexibility} to the framework, while the semantic-aware face editing model ensures \textit{controllability} of the framework. Our framework can create fine-grained semantic masks, making precise face editing possible, and significantly enhancing the controllability and flexibility of face editing models. Extensive experiments demonstrate that MuseFace achieves superior high-fidelity performance.

Paper Structure

This paper contains 13 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The overall pipeline of our proposed MuseFace which consists of two diffusion models, the Text-to-Mask diffusion model and the semantic-aware face editing model. The inputs of MuseFace are only reference images $\mathcal{I}$ text $\mathcal{T}_{edit}$ specifying the part to be edited, and the caption $\mathcal{T}_{caption}$ of $\mathcal{\hat{I}}$. The Text-to-Mask diffusion model edits the semantic map driven by $\mathcal{T}_{edit}$ and the output is used for face editing model to generate the edit face image. The dashed box in the lower left corner is a schematic of the training data accessibility.
  • Figure 2: Qualitative comparison of face editing by different methods. MuseFace is capable of handling a wide range of input scenarios with exceptional controllability and flexibility. For a fair comparison, it is recommended to follow the procedure on textual instructions, for instance, offering detailed instructions like "This man has a medium-length hair." instead of single word like "Hair". Please zoom in for better visualization and refer supp. material for more results.
  • Figure 3: Diverse results of cross-ID generataion. Multimodal face generation is performed using the output from MuseFace, where good consistency is maintained and diversity is also preserved.
  • Figure 4: The edited image in the wild. Users can input an image and text, and interact with MuseFace to generate an edited version of the image conveniently.