Table of Contents
Fetching ...

Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling

Matúš Pikuliak, Andrea Hrckova, Stefan Oresko, Marián Šimko

TL;DR

A new manually created dataset designed to measure gender-stereotypical reasoning in language models and machine translation systems, GEST contains samples for 16 gender stereotypes about men and women that are compatible with the English language and 9 Slavic languages.

Abstract

We present GEST -- a new manually created dataset designed to measure gender-stereotypical reasoning in language models and machine translation systems. GEST contains samples for 16 gender stereotypes about men and women (e.g., Women are beautiful, Men are leaders) that are compatible with the English language and 9 Slavic languages. The definition of said stereotypes was informed by gender experts. We used GEST to evaluate English and Slavic masked LMs, English generative LMs, and machine translation systems. We discovered significant and consistent amounts of gender-stereotypical reasoning in almost all the evaluated models and languages. Our experiments confirm the previously postulated hypothesis that the larger the model, the more stereotypical it usually is.

Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling

TL;DR

A new manually created dataset designed to measure gender-stereotypical reasoning in language models and machine translation systems, GEST contains samples for 16 gender stereotypes about men and women that are compatible with the English language and 9 Slavic languages.

Abstract

We present GEST -- a new manually created dataset designed to measure gender-stereotypical reasoning in language models and machine translation systems. GEST contains samples for 16 gender stereotypes about men and women (e.g., Women are beautiful, Men are leaders) that are compatible with the English language and 9 Slavic languages. The definition of said stereotypes was informed by gender experts. We used GEST to evaluate English and Slavic masked LMs, English generative LMs, and machine translation systems. We discovered significant and consistent amounts of gender-stereotypical reasoning in almost all the evaluated models and languages. Our experiments confirm the previously postulated hypothesis that the larger the model, the more stereotypical it usually is.
Paper Structure (76 sections, 5 equations, 14 figures, 11 tables)

This paper contains 76 sections, 5 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Basic overview of how we use one sample to test four different types of NLP systems. For all systems, we observe the grammatical gender (either feminine or masculine) of the predictions when the model is exposed to a stereotypical sentence. Other Slavic languages are used in the same way as Slovak is in this example.
  • Figure 2: Comparison of the global masculine rate $f_m$ and the stereotype rate $f_s$ for MT systems and target languages.
  • Figure 3: Boxplots for the feminine ranks of the stereotypes across all system-language pairs we evaluated in the MT experiment.
  • Figure 4: Stereotype rates $g_s$ for English MLMs and GLMs. GLMs are color-coded based on their family. The average score across all compatible templates is reported.
  • Figure 5: Boxplots for the feminine ranks of the stereotypes across all model-template pairs we evaluated in the experiment with English MLMs.
  • ...and 9 more figures