Table of Contents
Fetching ...

EuroGEST: Investigating gender stereotypes in multilingual language models

Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch

TL;DR

It is shown that larger models encode gendered stereotypes more strongly and that instruction finetuning does not consistently reduce gendered stereotypes, which highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.

Abstract

Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics. Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages. We use EuroGEST to evaluate 24 multilingual language models from six model families, demonstrating that the strongest stereotypes in all models across all languages are that women are 'beautiful', 'empathetic' and 'neat' and men are 'leaders', 'strong, tough' and 'professional'. We also show that larger models encode gendered stereotypes more strongly and that instruction finetuning does not consistently reduce gendered stereotypes. Our work highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.

EuroGEST: Investigating gender stereotypes in multilingual language models

TL;DR

It is shown that larger models encode gendered stereotypes more strongly and that instruction finetuning does not consistently reduce gendered stereotypes, which highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.

Abstract

Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics. Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages. We use EuroGEST to evaluate 24 multilingual language models from six model families, demonstrating that the strongest stereotypes in all models across all languages are that women are 'beautiful', 'empathetic' and 'neat' and men are 'leaders', 'strong, tough' and 'professional'. We also show that larger models encode gendered stereotypes more strongly and that instruction finetuning does not consistently reduce gendered stereotypes. Our work highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.

Paper Structure

This paper contains 30 sections, 2 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: System for translating English GEST sentences into gendered target languages and sorting translated sentences into EuroGEST gendered (EuroGEST$_G$) and EuroGEST neutral (EuroGEST$_N$).
  • Figure 2: Number of sentences in EuroGEST-gendered and EuroGEST-neutral datasets by language.
  • Figure 3: Masculine rank of each stereotype in each official language of the EU in three mid-sized European-centric LLMs. Rank 1 = most strongly associated with masculine gender; Rank 16 = most strongly associated with feminine gender. Red lines divide feminine (top) from masculine (bottom) stereotypes.
  • Figure 4: Divergence of $q_i$ scores for each stereotype from proxy default masculine rate towards stereotypical gender for feminine (top) and masculine (bottom) stereotypes in five sizes of Qwen 2.5 models.
  • Figure 5: Average stereotype rates of base and instruct models across all languages in EuroGEST. $g_s$ of 1.0 (dotted red line) is indicative of no stereotyping.
  • ...and 8 more figures