Table of Contents
Fetching ...

Gender Trouble in Language Models: An Empirical Audit Guided by Gender Performativity Theory

Franziska Sofia Hafner, Ana Valdivia, Luc Rocher

TL;DR

This work addresses how language models encode gender beyond mere stereotypes by applying gender performativity theory to audit the construction of gender in LMs. It introduces a theory-grounded framework that tests 16 open-source LMs using prompts and three log probability ratio probes to assess sex–gender associations and the pathologization of transgender and gender-diverse identities. The findings show that larger models increasingly encode a binary, biologically tied concept of gender, embed nonbinary terms poorly, and attach transgender/gender-diverse identities to mental illness, underscoring the need for theory-informed evaluation and careful debiasing. The study advocates rethinking how gender harms are defined and mitigated in LMs, emphasizes the downstream risks in deployment (notably in healthcare), and calls for interdisciplinary collaboration to broaden the imagined domain of gender in model development and evaluation.

Abstract

Language models encode and subsequently perpetuate harmful gendered stereotypes. Research has succeeded in mitigating some of these harms, e.g. by dissociating non-gendered terms such as occupations from gendered terms such as 'woman' and 'man'. This approach, however, remains superficial given that associations are only one form of prejudice through which gendered harms arise. Critical scholarship on gender, such as gender performativity theory, emphasizes how harms often arise from the construction of gender itself, such as conflating gender with biological sex. In language models, these issues could lead to the erasure of transgender and gender diverse identities and cause harms in downstream applications, from misgendering users to misdiagnosing patients based on wrong assumptions about their anatomy. For FAccT research on gendered harms to go beyond superficial linguistic associations, we advocate for a broader definition of 'gender bias' in language models. We operationalize insights on the construction of gender through language from gender studies literature and then empirically test how 16 language models of different architectures, training datasets, and model sizes encode gender. We find that language models tend to encode gender as a binary category tied to biological sex, and that gendered terms that do not neatly fall into one of these binary categories are erased and pathologized. Finally, we show that larger models, which achieve better results on performance benchmarks, learn stronger associations between gender and sex, further reinforcing a narrow understanding of gender. Our findings lead us to call for a re-evaluation of how gendered harms in language models are defined and addressed.

Gender Trouble in Language Models: An Empirical Audit Guided by Gender Performativity Theory

TL;DR

This work addresses how language models encode gender beyond mere stereotypes by applying gender performativity theory to audit the construction of gender in LMs. It introduces a theory-grounded framework that tests 16 open-source LMs using prompts and three log probability ratio probes to assess sex–gender associations and the pathologization of transgender and gender-diverse identities. The findings show that larger models increasingly encode a binary, biologically tied concept of gender, embed nonbinary terms poorly, and attach transgender/gender-diverse identities to mental illness, underscoring the need for theory-informed evaluation and careful debiasing. The study advocates rethinking how gender harms are defined and mitigated in LMs, emphasizes the downstream risks in deployment (notably in healthcare), and calls for interdisciplinary collaboration to broaden the imagined domain of gender in model development and evaluation.

Abstract

Language models encode and subsequently perpetuate harmful gendered stereotypes. Research has succeeded in mitigating some of these harms, e.g. by dissociating non-gendered terms such as occupations from gendered terms such as 'woman' and 'man'. This approach, however, remains superficial given that associations are only one form of prejudice through which gendered harms arise. Critical scholarship on gender, such as gender performativity theory, emphasizes how harms often arise from the construction of gender itself, such as conflating gender with biological sex. In language models, these issues could lead to the erasure of transgender and gender diverse identities and cause harms in downstream applications, from misgendering users to misdiagnosing patients based on wrong assumptions about their anatomy. For FAccT research on gendered harms to go beyond superficial linguistic associations, we advocate for a broader definition of 'gender bias' in language models. We operationalize insights on the construction of gender through language from gender studies literature and then empirically test how 16 language models of different architectures, training datasets, and model sizes encode gender. We find that language models tend to encode gender as a binary category tied to biological sex, and that gendered terms that do not neatly fall into one of these binary categories are erased and pathologized. Finally, we show that larger models, which achieve better results on performance benchmarks, learn stronger associations between gender and sex, further reinforcing a narrow understanding of gender. Our findings lead us to call for a re-evaluation of how gendered harms in language models are defined and addressed.

Paper Structure

This paper contains 23 sections, 5 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Probability of gender-related predictions by sex-related context The figure shows how language models assign gender labels depending on context information about sex characteristics. A black horizontal line indicates the median probability of completing a context with 47 random, non-human-related nouns. See \ref{['fig:gender_word_probabilities_test_appendix']} for smaller model results.
  • Figure 2: Alignment of Models with Folk Understanding of Gender by Model Size. This figure shows the extent to which language models associate male sex characteristics with men, and female sex characteristics with women. Larger models (by number of parameters) tend to have stronger associations than smaller ones ($\rho = 0.89$, $p < 0.01$, Spearman-Rank correlation). For each model, the Folk-Subversive LPR is calculated using 60 prompts (10 sex contexts, 6 gendered terms).
  • Figure 3: Alignment of gendered terms with male vs. female sex characteristics The figure shows how language models associate gendered terms with specific sex characteristics. There is a clear pattern of associating 'a man' more with male characteristics, and 'a woman' more with female characteristics. See \ref{['fig:log_prob_ratio_male_vs_female_trans_enby_appendix']} for smaller model results.
  • Figure 4: Distribution of Gender--Illness Log Probability Ratio per Gender Context The figure shows whether illness-related prediction become more (>0) or less (<0) probable conditional on the gendered context 'a woman', 'nonbinary', 'transgender', 'genderqueer', 'genderfluid', or 'two-spirit' compared to 'a man'. Overlapping distributions for mental and physical illnesses suggest similar associations for both illness types, whereas mental illness distributions skewing further right (as observed for all models in the 'nonbinary person' context, for example) indicate a stronger likelihood of predicting mental rather than physical illness in these contexts. Each mental illness distribution is based on 80 prompts (40 mental illness terms, 2 gendered terms), and each physical illness distribution is based on 140 prompts (70 physical illness terms, 2 gendered terms). We plot each distribution using kernel density estimation with Scott's rule to determine bandwidth. We report significance levels in the top-right corner of each panel ( * $p < 0.05$, ** $p < 0.01$, *** $p < 0.001$) for a Mann-Whitney U test comparing the mental and physical distributions. See \ref{['fig:mental_and_physicall_illness_by_gender_dist_appendix']} for smaller model results.
  • Figure 5: Illnesses Most and Least Associated with Gender Contexts Each panel shows the three illnesses most associated (>0) and least associated (<0) with a gender context compared to 'a man'. We abbreviate 'disorder' with 'dis.'. See \ref{['fig:mental_and_physicall_illness_by_gender_dist_appendix']} for smaller model results.
  • ...and 4 more figures