Are All Spanish Doctors Male? Evaluating Gender Bias in German Machine Translation
Michelle Kappl
TL;DR
This work introduces WinoMTDE, a German GBET built on Winograd-schema principles to evaluate gender bias and occupational stereotypes in German MT. It extends a prior English evaluation methodology to German, balancing 288 sentences across gender and stereotype categories and annotating them with occupation-informed stereotypes derived from German labor statistics. The authors conduct a large-scale assessment of six MT systems plus GPT-4o-mini translating from German to seven gender-marked languages, using a three-stage pipeline (translation, prediction, evaluation) to compute Acc, $F1$-scores, and bias metrics $\Delta_G$ and $\Delta_S$. Results reveal persistent gender bias across most models, with the LLM generally outperforming traditional MT systems but not eliminating bias, highlighting the need for more equitable translation approaches. The dataset and code are publicly available, establishing a foundation for systematic bias evaluation in German MT and guiding future improvements in fairness-aware translation.
Abstract
We present WinoMTDE, a new gender bias evaluation test set designed to assess occupational stereotyping and underrepresentation in German machine translation (MT) systems. Building on the automatic evaluation method introduced by arXiv:1906.00591v1, we extend the approach to German, a language with grammatical gender. The WinoMTDE dataset comprises 288 German sentences that are balanced in regard to gender, as well as stereotype, which was annotated using German labor statistics. We conduct a large-scale evaluation of five widely used MT systems and a large language model. Our results reveal persistent bias in most models, with the LLM outperforming traditional systems. The dataset and evaluation code are publicly available under https://github.com/michellekappl/mt_gender_german.
