Beyond Content: How Grammatical Gender Shapes Visual Representation in Text-to-Image Models
Muhammed Saeed, Shaina Raza, Ashmal Vayani, Muhammad Abdul-Mageed, Ali Emami, Shady Shehata
TL;DR
GramVis investigates whether grammatical gender shapes visual representations in multilingual text-to-image models, revealing robust male-leaning biases driven by masculine grammatical cues and more variable effects for feminine cues. The approach uses a cross-linguistic dataset of 800 gender-divergent words across seven languages and evaluates three advanced T2I models, generating $28{,}800$ images under controlled prompt templates. The findings show that language resource availability and model architecture systematically modulate these effects, with high-resource languages and Flux-like models exhibiting stronger associations. This work demonstrates that language structure itself meaningfully biases AI-generated visuals, offering a new dimension for assessing and mitigating bias in multilingual multimodal AI.
Abstract
Research on bias in Text-to-Image (T2I) models has primarily focused on demographic representation and stereotypical attributes, overlooking a fundamental question: how does grammatical gender influence visual representation across languages? We introduce a cross-linguistic benchmark examining words where grammatical gender contradicts stereotypical gender associations (e.g., ``une sentinelle'' - grammatically feminine in French but referring to the stereotypically masculine concept ``guard''). Our dataset spans five gendered languages (French, Spanish, German, Italian, Russian) and two gender-neutral control languages (English, Chinese), comprising 800 unique prompts that generated 28,800 images across three state-of-the-art T2I models. Our analysis reveals that grammatical gender dramatically influences image generation: masculine grammatical markers increase male representation to 73% on average (compared to 22% with gender-neutral English), while feminine grammatical markers increase female representation to 38% (compared to 28% in English). These effects vary systematically by language resource availability and model architecture, with high-resource languages showing stronger effects. Our findings establish that language structure itself, not just content, shapes AI-generated visual outputs, introducing a new dimension for understanding bias and fairness in multilingual, multimodal systems.
