Table of Contents
Fetching ...

Are All Spanish Doctors Male? Evaluating Gender Bias in German Machine Translation

Michelle Kappl

TL;DR

This work introduces WinoMTDE, a German GBET built on Winograd-schema principles to evaluate gender bias and occupational stereotypes in German MT. It extends a prior English evaluation methodology to German, balancing 288 sentences across gender and stereotype categories and annotating them with occupation-informed stereotypes derived from German labor statistics. The authors conduct a large-scale assessment of six MT systems plus GPT-4o-mini translating from German to seven gender-marked languages, using a three-stage pipeline (translation, prediction, evaluation) to compute Acc, $F1$-scores, and bias metrics $\Delta_G$ and $\Delta_S$. Results reveal persistent gender bias across most models, with the LLM generally outperforming traditional MT systems but not eliminating bias, highlighting the need for more equitable translation approaches. The dataset and code are publicly available, establishing a foundation for systematic bias evaluation in German MT and guiding future improvements in fairness-aware translation.

Abstract

We present WinoMTDE, a new gender bias evaluation test set designed to assess occupational stereotyping and underrepresentation in German machine translation (MT) systems. Building on the automatic evaluation method introduced by arXiv:1906.00591v1, we extend the approach to German, a language with grammatical gender. The WinoMTDE dataset comprises 288 German sentences that are balanced in regard to gender, as well as stereotype, which was annotated using German labor statistics. We conduct a large-scale evaluation of five widely used MT systems and a large language model. Our results reveal persistent bias in most models, with the LLM outperforming traditional systems. The dataset and evaluation code are publicly available under https://github.com/michellekappl/mt_gender_german.

Are All Spanish Doctors Male? Evaluating Gender Bias in German Machine Translation

TL;DR

This work introduces WinoMTDE, a German GBET built on Winograd-schema principles to evaluate gender bias and occupational stereotypes in German MT. It extends a prior English evaluation methodology to German, balancing 288 sentences across gender and stereotype categories and annotating them with occupation-informed stereotypes derived from German labor statistics. The authors conduct a large-scale assessment of six MT systems plus GPT-4o-mini translating from German to seven gender-marked languages, using a three-stage pipeline (translation, prediction, evaluation) to compute Acc, -scores, and bias metrics and . Results reveal persistent gender bias across most models, with the LLM generally outperforming traditional MT systems but not eliminating bias, highlighting the need for more equitable translation approaches. The dataset and code are publicly available, establishing a foundation for systematic bias evaluation in German MT and guiding future improvements in fairness-aware translation.

Abstract

We present WinoMTDE, a new gender bias evaluation test set designed to assess occupational stereotyping and underrepresentation in German machine translation (MT) systems. Building on the automatic evaluation method introduced by arXiv:1906.00591v1, we extend the approach to German, a language with grammatical gender. The WinoMTDE dataset comprises 288 German sentences that are balanced in regard to gender, as well as stereotype, which was annotated using German labor statistics. We conduct a large-scale evaluation of five widely used MT systems and a large language model. Our results reveal persistent bias in most models, with the LLM outperforming traditional systems. The dataset and evaluation code are publicly available under https://github.com/michellekappl/mt_gender_german.

Paper Structure

This paper contains 20 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Example of gender bias in German Machine Translation by Google Translate, where occupational stereotypes are reinforced.
  • Figure 2: Evaluation pipeline citekey. The German ground truth is indicated by orange and the translation by the MT model and the corresponding gender and subject predictions are indicated by violet.
  • Figure 3: Gender predictions for each occupation group across all languages and MT models were aggregated and visualized. Colors represent professional categories: blue hues for agriculturalagricultural, manufacturingmanufacturing, and constructionconstruction; turquoise for sciencessciences, logisticslogistics, and securitysecurity; green for cleaningcleaning, tourismtourism, and tradetrade; greenish-yellow for managementmanagement, officeoffice, and HRHR; yellow for financefinance and lawlaw; orange for healthcarehealthcare; red for educationeducation and social_worksocial work; and dark red for mediamedia, journalismjournalism, and designdesign. The x-axis corresponds to the real-world distribution of each occupation group (see \ref{['app:occupation_statistics']}), ranging from 100% female workers on the left to a 50% (50% male) balance in the middle, and finally to 0% (100% male) on the right. The grey vertical line marks occupations with minimal gender imbalance in the real world. The y-axis represents the gender distribution within the translated challenge set. An ideal translation would result in all markers aligning with the green horizontal line, indicating preserved original distribution as WinoMTDE is balanced in terms of gender and stereotypes.
  • Figure 4: Depiction of the percentage of female (violet), male (orange), neutral (blue), and unknown (light blue) translations across occupations. Dark shades represent correct gender matches, light shades indicate errors. Hatching shows the gender origin within neutral and unknown categories. The horizontal line marks the 50/50 male-female ground truth.