Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation

Dawei Dai; Mingming Jia; Yinxiu Zhou; Hang Xing; Chenghang Li

Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation

Dawei Dai, Mingming Jia, Yinxiu Zhou, Hang Xing, Chenghang Li

TL;DR

This paper tackles the difficulty of generating precise facial images with text prompts by introducing Face-MakeUp, a diffusion-based framework that leverages image prompts and facial-specific representations. It constructs FaceCaptionHQ-4M, a large high-quality facial image-text dataset, and integrates multi-scale facial features, ArcFace identity embeddings, and pose information through a Face-ID Cross-Attn fusion mechanism and PoseNet into Stable Diffusion. The approach yields superior realism, facial identity fidelity, and attribute richness on two test sets, with ablations confirming the importance of the dataset and pose information; manual evaluations corroborate improvements in realism and identity preservation. The work also demonstrates useful capabilities such as identity mixing and stylization, and it will release data, model checkpoints, and code to support reproducibility and further research.

Abstract

Facial images have extensive practical applications. Although the current large-scale text-image diffusion models exhibit strong generation capabilities, it is challenging to generate the desired facial images using only text prompt. Image prompts are a logical choice. However, current methods of this type generally focus on general domain. In this paper, we aim to optimize image makeup techniques to generate the desired facial images. Specifically, (1) we built a dataset of 4 million high-quality face image-text pairs (FaceCaptionHQ-4M) based on LAION-Face to train our Face-MakeUp model; (2) to maintain consistency with the reference facial image, we extract/learn multi-scale content features and pose features for the facial image, integrating these into the diffusion model to enhance the preservation of facial identity features for diffusion models. Validation on two face-related test datasets demonstrates that our Face-MakeUp can achieve the best comprehensive performance.All codes are available at:https://github.com/ddw2AIGROUP2CQUPT/Face-MakeUp

Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation

TL;DR

Abstract

Paper Structure (14 sections, 5 figures, 4 tables)

This paper contains 14 sections, 5 figures, 4 tables.

Introduction
Method
Constructing FaceCaptionHQ-4M
Architecture of Face-MakeUp
Extracting Representations for Facial Image-Text
Learning Representations of Facial Pose
Fusion Strategy
Experiments
Implementation Details
Comparisons with Existing Methods
Manual evaluation
Ablation Study
Other Applications
Conclusion

Figures (5)

Figure 1: Illustrations of our Face-MakeUp. The first column is the image prompts (reference), and others show the images generated by our model using reference images.
Figure 2: Overview of our Face-MakeUp.
Figure 3: Manual evaluation.
Figure 4: Identity mixing. Face-MakeUp is able to generate the image with a new ID while preserving input identity characteristics
Figure 5: Stylization. Face-MakeUp is able to generate the different styles of facial images using the image and text prompts.

Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation

TL;DR

Abstract

Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)