Table of Contents
Fetching ...

Neutral Prompts, Non-Neutral People: Quantifying Gender and Skin-Tone Bias in Gemini Flash 2.5 Image and GPT Image 1.5

Roberto Balestri

TL;DR

This work shows that neutral prompts do not yield demographically neutral imagery. By generating 3,200 images from four neutral prompts for two commercial image generators and applying a rigorously illuminated skin-tone analysis across MST, PERLA, and FST scales, the authors quantify gender, race, and skin-tone defaults and their interactions. They find a pronounced 'default white' bias (>96% white) and model-specific gender defaults (NanoBanana female-presenting vs GPT male-presenting with lighter skin), with clear prompt–gender–skin-tone interdependencies. The study introduces a robust, illumination-aware auditing framework that separates pigmentation from lighting and post-processing, offering actionable guidance for bias auditing and responsible deployment of generative imagery in diverse global contexts.

Abstract

This study quantifies gender and skin-tone bias in two widely deployed commercial image generators - Gemini Flash 2.5 Image (NanoBanana) and GPT Image 1.5 - to test the assumption that neutral prompts yield demographically neutral outputs. We generated 3,200 photorealistic images using four semantically neutral prompts. The analysis employed a rigorous pipeline combining hybrid color normalization, facial landmark masking, and perceptually uniform skin tone quantification using the Monk (MST), PERLA, and Fitzpatrick scales. Neutral prompts produced highly polarized defaults. Both models exhibited a strong "default white" bias (>96% of outputs). However, they diverged sharply on gender: Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones. This research provides a large-scale, comparative audit of state-of-the-art models using an illumination-aware colorimetric methodology, distinguishing aesthetic rendering from underlying pigmentation in synthetic imagery. The study demonstrates that neutral prompts function as diagnostic probes rather than neutral instructions. It offers a robust framework for auditing algorithmic visual culture and challenges the sociolinguistic assumption that unmarked language results in inclusive representation.

Neutral Prompts, Non-Neutral People: Quantifying Gender and Skin-Tone Bias in Gemini Flash 2.5 Image and GPT Image 1.5

TL;DR

This work shows that neutral prompts do not yield demographically neutral imagery. By generating 3,200 images from four neutral prompts for two commercial image generators and applying a rigorously illuminated skin-tone analysis across MST, PERLA, and FST scales, the authors quantify gender, race, and skin-tone defaults and their interactions. They find a pronounced 'default white' bias (>96% white) and model-specific gender defaults (NanoBanana female-presenting vs GPT male-presenting with lighter skin), with clear prompt–gender–skin-tone interdependencies. The study introduces a robust, illumination-aware auditing framework that separates pigmentation from lighting and post-processing, offering actionable guidance for bias auditing and responsible deployment of generative imagery in diverse global contexts.

Abstract

This study quantifies gender and skin-tone bias in two widely deployed commercial image generators - Gemini Flash 2.5 Image (NanoBanana) and GPT Image 1.5 - to test the assumption that neutral prompts yield demographically neutral outputs. We generated 3,200 photorealistic images using four semantically neutral prompts. The analysis employed a rigorous pipeline combining hybrid color normalization, facial landmark masking, and perceptually uniform skin tone quantification using the Monk (MST), PERLA, and Fitzpatrick scales. Neutral prompts produced highly polarized defaults. Both models exhibited a strong "default white" bias (>96% of outputs). However, they diverged sharply on gender: Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones. This research provides a large-scale, comparative audit of state-of-the-art models using an illumination-aware colorimetric methodology, distinguishing aesthetic rendering from underlying pigmentation in synthetic imagery. The study demonstrates that neutral prompts function as diagnostic probes rather than neutral instructions. It offers a robust framework for auditing algorithmic visual culture and challenges the sociolinguistic assumption that unmarked language results in inclusive representation.
Paper Structure (41 sections, 5 figures, 3 tables)

This paper contains 41 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Visual demonstration of the landmark-based skin masking pipeline. The pink overlay indicates the valid pixel region retained for colorimetric analysis. Note the precise exclusion of non-skin features (eyes, eyebrows, nostrils, lips) and facial hair regions to prevent color contamination.
  • Figure 2: Percentage of generated subjects classified as men across four neutral prompts. The chart highlights the prompt-dependent variability in GPT (orange) versus the consistently female-skewed output of NanoBanana (blue). Note the sharp inversion for GPT on the prompt "someone," which flips from a male majority to a female majority.
  • Figure 3: Reference color palettes for the three dermatological scales used in this study: Monk Skin Tone (MST), PERLA, and Fitzpatrick Skin Type (FST). Each scale provides a set of reference values against which generated skin tones are mapped using Euclidean distance in CIELAB space.
  • Figure 4: Mean Monk Skin Tone (MST) scores by prompt and model (Scale 1--10, where 10=Darkest). Error bars represent standard deviation.
  • Figure 6: Stacked bar chart of Fitzpatrick Skin Type (FST) distribution by prompt and model. GPT (left bars) is dominated by Types I--II (Blue/Orange), indicating a strong bias toward pale and fair skin. NanoBanana (right bars) shows a broader distribution, extending significantly into Types III (Green) and IV (Red).