MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach

Xin Zhang; Siting Huang; Xiangyang Luo; Yifan Xie; Weijiang Yu; Heng Chang; Fei Ma; Fei Yu

MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach

Xin Zhang, Siting Huang, Xiangyang Luo, Yifan Xie, Weijiang Yu, Heng Chang, Fei Ma, Fei Yu

TL;DR

MuseFace tackles the challenge of controllable, diverse, and flexible text-driven face editing by recasting edits from pixel space to semantic space and introducing a two-stage diffusion framework. Stage 1 (Text-to-Mask) generates fine-grained semantic masks from text prompts, leveraging amodal segmentation data and a Mask-aware Autoencoder to preserve unchanged regions. Stage 2 (Semantic-aware Face Editing) applies a multimodal diffusion-based editing model conditioned on the generated masks to edit the image guided by a caption, with training updating only the conditional network. Empirically, MuseFace delivers superior fidelity, realism, and controllability across quantitative metrics and qualitative studies, including zero-shot and cross-ID editing scenarios, demonstrating strong practical impact for high-quality, text-controlled face editing.

Abstract

Face editing modifies the appearance of face, which plays a key role in customization and enhancement of personal images. Although much work have achieved remarkable success in text-driven face editing, they still face significant challenges as none of them simultaneously fulfill the characteristics of diversity, controllability and flexibility. To address this challenge, we propose MuseFace, a text-driven face editing framework, which relies solely on text prompt to enable face editing. Specifically, MuseFace integrates a Text-to-Mask diffusion model and a semantic-aware face editing model, capable of directly generating fine-grained semantic masks from text and performing face editing. The Text-to-Mask diffusion model provides \textit{diversity} and \textit{flexibility} to the framework, while the semantic-aware face editing model ensures \textit{controllability} of the framework. Our framework can create fine-grained semantic masks, making precise face editing possible, and significantly enhancing the controllability and flexibility of face editing models. Extensive experiments demonstrate that MuseFace achieves superior high-fidelity performance.

MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach

TL;DR

Abstract

MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)