Table of Contents
Fetching ...

Analyzing Quality, Bias, and Performance in Text-to-Image Generative Models

Nila Masrourisaadat, Nazanin Sedaghatkish, Fatemeh Sarshartehrani, Edward A. Fox

TL;DR

The paper analyzes quality, bias, and performance of text-to-image diffusion models on face and motion prompts, highlighting that higher-capacity models (e.g., Stable Diffusion) produce higher-fidelity images as measured by $FID$ and $R$-Precision, yet exhibit gender- and race-related biases in neutral and stereotype-laden prompts. It introduces a data-processing pipeline using COCO and Flickr30k to create face- and motion-focused evaluation sets, and applies both quantitative metrics and a structured qualitative bias analysis with human raters. The key contributions include a comprehensive bias-analysis framework (176 prompts across race and gender with 2,816 generated images), empirical findings on model biases (e.g., white-male bias under certain prompts, uncertain gender/race in some models), and a nuanced comparison of diffusion-model performance across benchmarks. The results underscore the need for careful model selection and bias-mitigation when deploying text-to-image systems in sensitive domains, given that even top-performing models can reflect and amplify societal biases.

Abstract

Advances in generative models have led to significant interest in image synthesis, demonstrating the ability to generate high-quality images for a diverse range of text prompts. Despite this progress, most studies ignore the presence of bias. In this paper, we examine several text-to-image models not only by qualitatively assessing their performance in generating accurate images of human faces, groups, and specified numbers of objects but also by presenting a social bias analysis. As expected, models with larger capacity generate higher-quality images. However, we also document the inherent gender or social biases these models possess, offering a more complete understanding of their impact and limitations.

Analyzing Quality, Bias, and Performance in Text-to-Image Generative Models

TL;DR

The paper analyzes quality, bias, and performance of text-to-image diffusion models on face and motion prompts, highlighting that higher-capacity models (e.g., Stable Diffusion) produce higher-fidelity images as measured by and -Precision, yet exhibit gender- and race-related biases in neutral and stereotype-laden prompts. It introduces a data-processing pipeline using COCO and Flickr30k to create face- and motion-focused evaluation sets, and applies both quantitative metrics and a structured qualitative bias analysis with human raters. The key contributions include a comprehensive bias-analysis framework (176 prompts across race and gender with 2,816 generated images), empirical findings on model biases (e.g., white-male bias under certain prompts, uncertain gender/race in some models), and a nuanced comparison of diffusion-model performance across benchmarks. The results underscore the need for careful model selection and bias-mitigation when deploying text-to-image systems in sensitive domains, given that even top-performing models can reflect and amplify societal biases.

Abstract

Advances in generative models have led to significant interest in image synthesis, demonstrating the ability to generate high-quality images for a diverse range of text prompts. Despite this progress, most studies ignore the presence of bias. In this paper, we examine several text-to-image models not only by qualitatively assessing their performance in generating accurate images of human faces, groups, and specified numbers of objects but also by presenting a social bias analysis. As expected, models with larger capacity generate higher-quality images. However, we also document the inherent gender or social biases these models possess, offering a more complete understanding of their impact and limitations.
Paper Structure (29 sections, 1 equation, 2 figures, 4 tables, 6 algorithms)

This paper contains 29 sections, 1 equation, 2 figures, 4 tables, 6 algorithms.

Figures (2)

  • Figure 1: Examples of generated face images by each text-to-image model
  • Figure 2: Examples of generated samples with the prompt: An employee takes time off work to care for sick children at home.