Table of Contents
Fetching ...

What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study

Beatrice Savoldi, Sara Papi, Matteo Negri, Ana Guerberof, Luisa Bentivogli

TL;DR

An extensive human-centered study to examine if and to what extent bias in MT brings harms with tangible costs, such as quality of service gaps across women and men, and advocate for human-centered approaches that can inform the societal impact of bias.

Abstract

Gender bias in machine translation (MT) is recognized as an issue that can harm people and society. And yet, advancements in the field rarely involve people, the final MT users, or inform how they might be impacted by biased technologies. Current evaluations are often restricted to automatic methods, which offer an opaque estimate of what the downstream impact of gender disparities might be. We conduct an extensive human-centered study to examine if and to what extent bias in MT brings harms with tangible costs, such as quality of service gaps across women and men. To this aim, we collect behavioral data from 90 participants, who post-edited MT outputs to ensure correct gender translation. Across multiple datasets, languages, and types of users, our study shows that feminine post-editing demands significantly more technical and temporal effort, also corresponding to higher financial costs. Existing bias measurements, however, fail to reflect the found disparities. Our findings advocate for human-centered approaches that can inform the societal impact of bias.

What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study

TL;DR

An extensive human-centered study to examine if and to what extent bias in MT brings harms with tangible costs, such as quality of service gaps across women and men, and advocate for human-centered approaches that can inform the societal impact of bias.

Abstract

Gender bias in machine translation (MT) is recognized as an issue that can harm people and society. And yet, advancements in the field rarely involve people, the final MT users, or inform how they might be impacted by biased technologies. Current evaluations are often restricted to automatic methods, which offer an opaque estimate of what the downstream impact of gender disparities might be. We conduct an extensive human-centered study to examine if and to what extent bias in MT brings harms with tangible costs, such as quality of service gaps across women and men. To this aim, we collect behavioral data from 90 participants, who post-edited MT outputs to ensure correct gender translation. Across multiple datasets, languages, and types of users, our study shows that feminine post-editing demands significantly more technical and temporal effort, also corresponding to higher financial costs. Existing bias measurements, however, fail to reflect the found disparities. Our findings advocate for human-centered approaches that can inform the societal impact of bias.
Paper Structure (55 sections, 11 figures, 8 tables)

This paper contains 55 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Harms as assessed in our study design. We task participants with the post-editing of an MT output into both feminine and masculine gender. We collect behavioural data (i.e. time and technical effort) and assess higher workload and economic costs associated with feminine translations.
  • Figure 2: Human involvement in the assessment and framing of gender (bias) in MT, based on an ACL Anthology search. For studies with human participants, we distinguish qualitative, but yet model-centric manual evaluation, and more human-centric designs -- i.e. survey studies and participatory approaches.
  • Figure 3: HTER distribution across post-edited sentences.
  • Figure 4: Seconds per source word distribution across post-edited sentences.
  • Figure 5: Scatter plots with overlaid regression lines of the differences between F and M scores for all datasets, languages and users. Each point represents a sentence-level difference. The correlation between the different metrics is measured with the Pearson $r$ coefficient, and all results are statistically significant (p-value $<0.05$).
  • ...and 6 more figures