Analyzing Quality, Bias, and Performance in Text-to-Image Generative Models
Nila Masrourisaadat, Nazanin Sedaghatkish, Fatemeh Sarshartehrani, Edward A. Fox
TL;DR
The paper analyzes quality, bias, and performance of text-to-image diffusion models on face and motion prompts, highlighting that higher-capacity models (e.g., Stable Diffusion) produce higher-fidelity images as measured by $FID$ and $R$-Precision, yet exhibit gender- and race-related biases in neutral and stereotype-laden prompts. It introduces a data-processing pipeline using COCO and Flickr30k to create face- and motion-focused evaluation sets, and applies both quantitative metrics and a structured qualitative bias analysis with human raters. The key contributions include a comprehensive bias-analysis framework (176 prompts across race and gender with 2,816 generated images), empirical findings on model biases (e.g., white-male bias under certain prompts, uncertain gender/race in some models), and a nuanced comparison of diffusion-model performance across benchmarks. The results underscore the need for careful model selection and bias-mitigation when deploying text-to-image systems in sensitive domains, given that even top-performing models can reflect and amplify societal biases.
Abstract
Advances in generative models have led to significant interest in image synthesis, demonstrating the ability to generate high-quality images for a diverse range of text prompts. Despite this progress, most studies ignore the presence of bias. In this paper, we examine several text-to-image models not only by qualitatively assessing their performance in generating accurate images of human faces, groups, and specified numbers of objects but also by presenting a social bias analysis. As expected, models with larger capacity generate higher-quality images. However, we also document the inherent gender or social biases these models possess, offering a more complete understanding of their impact and limitations.
