Table of Contents
Fetching ...

FaceScore: Benchmarking and Enhancing Face Quality in Human Generation

Zhenyi Liao, Qingsong Xie, Chen Chen, Hannan Lu, Zhijie Deng

TL;DR

A novel metric named FaceScore (FS), developed by fine-tuning the widely used ImageReward on a dataset of (win, loss) face pairs cheaply crafted by an inpainting pipeline of DMs, which enjoys a superior alignment with humans.

Abstract

Diffusion models (DMs) have achieved significant success in generating imaginative images given textual descriptions. However, they are likely to fall short when it comes to real-life scenarios with intricate details. The low-quality, unrealistic human faces in text-to-image generation are one of the most prominent issues, hindering the wide application of DMs in practice. Targeting addressing such an issue, we first assess the face quality of generations from popular pre-trained DMs with the aid of human annotators and then evaluate the alignment between existing metrics with human judgments. Observing that existing metrics can be unsatisfactory for quantifying face quality, we develop a novel metric named FaceScore (FS) by fine-tuning the widely used ImageReward on a dataset of (win, loss) face pairs cheaply crafted by an inpainting pipeline of DMs. Extensive studies reveal FS enjoys a superior alignment with humans. On the other hand, FS opens up the door for enhancing DMs for better face generation. With FS offering image ratings, we can easily perform preference learning algorithms to refine DMs like SDXL. Comprehensive experiments verify the efficacy of our approach for improving face quality. The code is released at https://github.com/OPPO-Mente-Lab/FaceScore.

FaceScore: Benchmarking and Enhancing Face Quality in Human Generation

TL;DR

A novel metric named FaceScore (FS), developed by fine-tuning the widely used ImageReward on a dataset of (win, loss) face pairs cheaply crafted by an inpainting pipeline of DMs, which enjoys a superior alignment with humans.

Abstract

Diffusion models (DMs) have achieved significant success in generating imaginative images given textual descriptions. However, they are likely to fall short when it comes to real-life scenarios with intricate details. The low-quality, unrealistic human faces in text-to-image generation are one of the most prominent issues, hindering the wide application of DMs in practice. Targeting addressing such an issue, we first assess the face quality of generations from popular pre-trained DMs with the aid of human annotators and then evaluate the alignment between existing metrics with human judgments. Observing that existing metrics can be unsatisfactory for quantifying face quality, we develop a novel metric named FaceScore (FS) by fine-tuning the widely used ImageReward on a dataset of (win, loss) face pairs cheaply crafted by an inpainting pipeline of DMs. Extensive studies reveal FS enjoys a superior alignment with humans. On the other hand, FS opens up the door for enhancing DMs for better face generation. With FS offering image ratings, we can easily perform preference learning algorithms to refine DMs like SDXL. Comprehensive experiments verify the efficacy of our approach for improving face quality. The code is released at https://github.com/OPPO-Mente-Lab/FaceScore.

Paper Structure

This paper contains 27 sections, 6 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Bad face generated by Realistic Vision V5.1 (the left one) and SDXL (the right one) with prompts below. Faces, especially small-scale faces, are highly likely to be vague and irrational. We enlarge the face region and place it in the bottom left corner of the image. Zoom in for face details.
  • Figure 2: Comparison between generations sampled without (left) and with (right) negative prompts from Realistic Vision V5.1. Experiments are under the same conditions except for negative prompts, set as "bad face, deformed, poorly drawn face, mutated, ugly, bad anatomy". Enhancement can be observed in the face region with negative prompts. However, the generation still suffers from low quality. Zoom in for more face details.
  • Figure 3: An example of the human-annotated triplet. The image with higher face quality is assigned a higher score. In each triplet, there are 3 binary comparisons.
  • Figure 4: An example of a face pair. We use the inpainting pipeline and control the noise strength for a degraded version, thereby forming a (win, loss) face pair.
  • Figure 5: Overview of our pipeline. We leverage the inpainting pipeline on face images to get a negative sample, thus forming a (win, loss) face pair. We can use such a pair in fine-tuning an aesthetic scorer specifically for face quality. With such a metric, we can filter the data to fine-tune T2I diffusion models for better face quality.
  • ...and 8 more figures