Table of Contents
Fetching ...

Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney and DALL-E 2

Ali Borji

TL;DR

This paper introduces a framework for fine-grained evaluation of generated faces in cluttered scenes using Fréchet Inception Distance (FID) and a new GFW dataset of 15,076 generated faces plus 30,000 real faces. By prompting COCO captions and detecting faces with MediaPipe, the authors compare Stable Diffusion, Midjourney, and DALL-E 2, finding Stable Diffusion yields the best FID while DALL-E 2 lags, partly due to safeguards and data scale. The study highlights the challenges of evaluating generated faces in the wild, including dataset size effects, potential memorization, and watermark presence, and suggests avenues for richer metrics and bias analysis. Overall, the work provides a baseline for future cross-model assessments and motivates larger-scale, nuanced evaluations of face-generation capabilities in real-world contexts.

Abstract

The field of image synthesis has made great strides in the last couple of years. Recent models are capable of generating images with astonishing quality. Fine-grained evaluation of these models on some interesting categories such as faces is still missing. Here, we conduct a quantitative comparison of three popular systems including Stable Diffusion, Midjourney, and DALL-E 2 in their ability to generate photorealistic faces in the wild. We find that Stable Diffusion generates better faces than the other systems, according to the FID score. We also introduce a dataset of generated faces in the wild dubbed GFW, including a total of 15,076 faces. Furthermore, we hope that our study spurs follow-up research in assessing the generative models and improving them. Data and code are available at data and code, respectively.

Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney and DALL-E 2

TL;DR

This paper introduces a framework for fine-grained evaluation of generated faces in cluttered scenes using Fréchet Inception Distance (FID) and a new GFW dataset of 15,076 generated faces plus 30,000 real faces. By prompting COCO captions and detecting faces with MediaPipe, the authors compare Stable Diffusion, Midjourney, and DALL-E 2, finding Stable Diffusion yields the best FID while DALL-E 2 lags, partly due to safeguards and data scale. The study highlights the challenges of evaluating generated faces in the wild, including dataset size effects, potential memorization, and watermark presence, and suggests avenues for richer metrics and bias analysis. Overall, the work provides a baseline for future cross-model assessments and motivates larger-scale, nuanced evaluations of face-generation capabilities in real-world contexts.

Abstract

The field of image synthesis has made great strides in the last couple of years. Recent models are capable of generating images with astonishing quality. Fine-grained evaluation of these models on some interesting categories such as faces is still missing. Here, we conduct a quantitative comparison of three popular systems including Stable Diffusion, Midjourney, and DALL-E 2 in their ability to generate photorealistic faces in the wild. We find that Stable Diffusion generates better faces than the other systems, according to the FID score. We also introduce a dataset of generated faces in the wild dubbed GFW, including a total of 15,076 faces. Furthermore, we hope that our study spurs follow-up research in assessing the generative models and improving them. Data and code are available at data and code, respectively.
Paper Structure (7 sections, 6 figures)

This paper contains 7 sections, 6 figures.

Figures (6)

  • Figure 1: Left: FID scores of models over random sets of 5000 faces. Right: Results with samples of size 676 per model. Notice that the lower the FID, the better. Results are averaged over 10 runs.
  • Figure 2: Samples of real images (top row) and generated images.
  • Figure 3: Samples of real faces (top row) and generated faces.
  • Figure 4: Low quality faces generated by Stable Diffusion.
  • Figure 5: Low quality faces generated by Midjourney.
  • ...and 1 more figures