Table of Contents
Fetching ...

Improving face generation quality and prompt following with synthetic captions

Michail Tarasiou, Stylianos Moschoglou, Jiankang Deng, Stefanos Zafeiriou

TL;DR

The paper addresses the challenge that diffusion models struggle to faithfully follow prompts for human faces due to missing appearance details in standard captions. It introduces a training-free pipeline that extracts rich facial appearance attributes, converts them into synthetic captions via a bag-of-words-to-caption process using Vicuna 13B, and uses these captions to fine-tune Stable Diffusion 2.1 with LoRA. Experiments on EasyPortrait, FFHQ, and LAION-Face demonstrate that the finetuned model yields more realistic faces and better prompt adherence than the base model, while preserving identity under changes in age, gender, and ethnicity. The approach reduces dependency on prompt engineering and provides publicly released synthetic captions, pretrained checkpoints, and code to foster further research in realistic human-face generation.

Abstract

Recent advancements in text-to-image generation using diffusion models have significantly improved the quality of generated images and expanded the ability to depict a wide range of objects. However, ensuring that these models adhere closely to the text prompts remains a considerable challenge. This issue is particularly pronounced when trying to generate photorealistic images of humans. Without significant prompt engineering efforts models often produce unrealistic images and typically fail to incorporate the full extent of the prompt information. This limitation can be largely attributed to the nature of captions accompanying the images used in training large scale diffusion models, which typically prioritize contextual information over details related to the person's appearance. In this paper we address this issue by introducing a training-free pipeline designed to generate accurate appearance descriptions from images of people. We apply this method to create approximately 250,000 captions for publicly available face datasets. We then use these synthetic captions to fine-tune a text-to-image diffusion model. Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces and enhances adherence to the given prompts, compared to the baseline model. We share our synthetic captions, pretrained checkpoints and training code.

Improving face generation quality and prompt following with synthetic captions

TL;DR

The paper addresses the challenge that diffusion models struggle to faithfully follow prompts for human faces due to missing appearance details in standard captions. It introduces a training-free pipeline that extracts rich facial appearance attributes, converts them into synthetic captions via a bag-of-words-to-caption process using Vicuna 13B, and uses these captions to fine-tune Stable Diffusion 2.1 with LoRA. Experiments on EasyPortrait, FFHQ, and LAION-Face demonstrate that the finetuned model yields more realistic faces and better prompt adherence than the base model, while preserving identity under changes in age, gender, and ethnicity. The approach reduces dependency on prompt engineering and provides publicly released synthetic captions, pretrained checkpoints, and code to foster further research in realistic human-face generation.

Abstract

Recent advancements in text-to-image generation using diffusion models have significantly improved the quality of generated images and expanded the ability to depict a wide range of objects. However, ensuring that these models adhere closely to the text prompts remains a considerable challenge. This issue is particularly pronounced when trying to generate photorealistic images of humans. Without significant prompt engineering efforts models often produce unrealistic images and typically fail to incorporate the full extent of the prompt information. This limitation can be largely attributed to the nature of captions accompanying the images used in training large scale diffusion models, which typically prioritize contextual information over details related to the person's appearance. In this paper we address this issue by introducing a training-free pipeline designed to generate accurate appearance descriptions from images of people. We apply this method to create approximately 250,000 captions for publicly available face datasets. We then use these synthetic captions to fine-tune a text-to-image diffusion model. Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces and enhances adherence to the given prompts, compared to the baseline model. We share our synthetic captions, pretrained checkpoints and training code.
Paper Structure (10 sections, 4 figures)

This paper contains 10 sections, 4 figures.

Figures (4)

  • Figure 1: Comparison with captions from the LAION-Face dataset farl. For each image we show the caption provided in LAION-Face (top) and the output of our synthetic captioning pipeline (bottom).
  • Figure 2: Comparison with SD2.1 base model. For each prompt we show the image generated by the SD2.1 base model (left) as well as our finetuned LoRA model (right).
  • Figure 3: Generated images varying the person's age. Images are generated through the prompt "A $\{$age$\}$ year old white male with black hair and happy expression." where age is substituted by the number shown in each image. We observe that identity characteristics are decoupled from age.
  • Figure 4: Generated images varying a person's ethnicity, emotion and gender. Images are generated through the prompt "A 40 year old $\{$ethnicity$\}$$\{$gender$\}$ with black hair and $\{$expression$\}$ expression." where ethnicity, gender and expression are modified accordingly. Gender varies across the left and right figures. Rows represent Black, White and Asian race. Respective figure columns show generations for happy, neutral and sad expressions.