Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney and DALL-E 2
Ali Borji
TL;DR
This paper introduces a framework for fine-grained evaluation of generated faces in cluttered scenes using Fréchet Inception Distance (FID) and a new GFW dataset of 15,076 generated faces plus 30,000 real faces. By prompting COCO captions and detecting faces with MediaPipe, the authors compare Stable Diffusion, Midjourney, and DALL-E 2, finding Stable Diffusion yields the best FID while DALL-E 2 lags, partly due to safeguards and data scale. The study highlights the challenges of evaluating generated faces in the wild, including dataset size effects, potential memorization, and watermark presence, and suggests avenues for richer metrics and bias analysis. Overall, the work provides a baseline for future cross-model assessments and motivates larger-scale, nuanced evaluations of face-generation capabilities in real-world contexts.
Abstract
The field of image synthesis has made great strides in the last couple of years. Recent models are capable of generating images with astonishing quality. Fine-grained evaluation of these models on some interesting categories such as faces is still missing. Here, we conduct a quantitative comparison of three popular systems including Stable Diffusion, Midjourney, and DALL-E 2 in their ability to generate photorealistic faces in the wild. We find that Stable Diffusion generates better faces than the other systems, according to the FID score. We also introduce a dataset of generated faces in the wild dubbed GFW, including a total of 15,076 faces. Furthermore, we hope that our study spurs follow-up research in assessing the generative models and improving them. Data and code are available at data and code, respectively.
