Neutral Prompts, Non-Neutral People: Quantifying Gender and Skin-Tone Bias in Gemini Flash 2.5 Image and GPT Image 1.5
Roberto Balestri
TL;DR
This work shows that neutral prompts do not yield demographically neutral imagery. By generating 3,200 images from four neutral prompts for two commercial image generators and applying a rigorously illuminated skin-tone analysis across MST, PERLA, and FST scales, the authors quantify gender, race, and skin-tone defaults and their interactions. They find a pronounced 'default white' bias (>96% white) and model-specific gender defaults (NanoBanana female-presenting vs GPT male-presenting with lighter skin), with clear prompt–gender–skin-tone interdependencies. The study introduces a robust, illumination-aware auditing framework that separates pigmentation from lighting and post-processing, offering actionable guidance for bias auditing and responsible deployment of generative imagery in diverse global contexts.
Abstract
This study quantifies gender and skin-tone bias in two widely deployed commercial image generators - Gemini Flash 2.5 Image (NanoBanana) and GPT Image 1.5 - to test the assumption that neutral prompts yield demographically neutral outputs. We generated 3,200 photorealistic images using four semantically neutral prompts. The analysis employed a rigorous pipeline combining hybrid color normalization, facial landmark masking, and perceptually uniform skin tone quantification using the Monk (MST), PERLA, and Fitzpatrick scales. Neutral prompts produced highly polarized defaults. Both models exhibited a strong "default white" bias (>96% of outputs). However, they diverged sharply on gender: Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones. This research provides a large-scale, comparative audit of state-of-the-art models using an illumination-aware colorimetric methodology, distinguishing aesthetic rendering from underlying pigmentation in synthetic imagery. The study demonstrates that neutral prompts function as diagnostic probes rather than neutral instructions. It offers a robust framework for auditing algorithmic visual culture and challenges the sociolinguistic assumption that unmarked language results in inclusive representation.
