Beyond the Prompt: Gender Bias in Text-to-Image Models, with a Case Study on Hospital Professions
Franck Vandewiele, Remi Synave, Samuel Delepoulle, Remi Cozot
TL;DR
This study examines gender bias in six open-weight text-to-image models by generating images of five hospital professions under varied portrait qualifiers. Using a unified prompting framework and manual gender annotation, the authors show consistent stereotypes—nurses as female and surgeons as male—while revealing model-specific differences in prompt sensitivity and bias strength. The findings highlight that prompt wording (e.g., corporate vs. beautiful qualifiers) can modulate gender balance, underscoring the need for bias-aware design, balanced defaults, and user guidance in generative AI. The work calls for broader mitigation strategies and extension to intersectional dimensions to ensure fair, diverse representations in professional imagery.
Abstract
Text-to-image (TTI) models are increasingly used in professional, educational, and creative contexts, yet their outputs often embed and amplify social biases. This paper investigates gender representation in six state-of-the-art open-weight models: HunyuanImage 2.1, HiDream-I1-dev, Qwen-Image, FLUX.1-dev, Stable-Diffusion 3.5 Large, and Stable-Diffusion-XL. Using carefully designed prompts, we generated 100 images for each combination of five hospital-related professions (cardiologist, hospital director, nurse, paramedic, surgeon) and five portrait qualifiers ("", corporate, neutral, aesthetic, beautiful). Our analysis reveals systematic occupational stereotypes: all models produced nurses exclusively as women and surgeons predominantly as men. However, differences emerge across models: Qwen-Image and SDXL enforce rigid male dominance, HiDream-I1-dev shows mixed outcomes, and FLUX.1-dev skews female in most roles. HunyuanImage 2.1 and Stable-Diffusion 3.5 Large also reproduce gender stereotypes but with varying degrees of sensitivity to prompt formulation. Portrait qualifiers further modulate gender balance, with terms like corporate reinforcing male depictions and beautiful favoring female ones. Sensitivity varies widely: Qwen-Image remains nearly unaffected, while FLUX.1-dev, SDXL, and SD3.5 show strong prompt dependence. These findings demonstrate that gender bias in TTI models is both systematic and model-specific. Beyond documenting disparities, we argue that prompt wording plays a critical role in shaping demographic outcomes. The results underscore the need for bias-aware design, balanced defaults, and user guidance to prevent the reinforcement of occupational stereotypes in generative AI.
