Table of Contents
Fetching ...

Uncertainty Quantification for Evaluating Machine Translation Bias

Ieva Raminta Staliūnaitė, Julius Cheng, Andreas Vlachos

TL;DR

The paper tackles gender bias in machine translation under input ambiguity by introducing semantic uncertainty as a bias diagnostic via $\Delta I$ and $\Delta \mathcal{H}$ based on sampling-based metrics SE, S3E, and GE. It validates that $\Delta I$ correlates with gender accuracy on unambiguous items (notably in WinoMT) and shows that $\Delta \mathcal{H}$ exposes ambiguity-driven bias that interacts with translation quality and language-specific phenomena. The study uses WinoMT, mGeNTe, and human-annotated translations across seven target languages, evaluating multiple MT models and debiasing strategies, revealing that higher translation accuracy does not universally reduce bias under ambiguity and that debiasing effects vary by language and input clarity. These findings highlight the practical value of uncertainty-aware bias evaluation for developing fairer MT systems and guiding debiasing efforts beyond straightforward accuracy metrics.

Abstract

The predictive uncertainty of machine translation (MT) models is typically used as a quality estimation proxy. In this work, we posit that apart from confidently translating when a single correct translation exists, models should also maintain uncertainty when the input is ambiguous. We use uncertainty to measure gender bias in MT systems. When the source sentence includes a lexeme whose gender is not overtly marked, but whose target-language equivalent requires gender specification, the model must infer the appropriate gender from the context and can be susceptible to biases. Prior work measured bias via gender accuracy, however it cannot be applied to ambiguous cases. Using semantic uncertainty, we are able to assess bias when translating both ambiguous and unambiguous source sentences, and find that high translation accuracy does not correlate with exhibiting uncertainty appropriately, and that debiasing affects the two cases differently.

Uncertainty Quantification for Evaluating Machine Translation Bias

TL;DR

The paper tackles gender bias in machine translation under input ambiguity by introducing semantic uncertainty as a bias diagnostic via and based on sampling-based metrics SE, S3E, and GE. It validates that correlates with gender accuracy on unambiguous items (notably in WinoMT) and shows that exposes ambiguity-driven bias that interacts with translation quality and language-specific phenomena. The study uses WinoMT, mGeNTe, and human-annotated translations across seven target languages, evaluating multiple MT models and debiasing strategies, revealing that higher translation accuracy does not universally reduce bias under ambiguity and that debiasing effects vary by language and input clarity. These findings highlight the practical value of uncertainty-aware bias evaluation for developing fairer MT systems and guiding debiasing efforts beyond straightforward accuracy metrics.

Abstract

The predictive uncertainty of machine translation (MT) models is typically used as a quality estimation proxy. In this work, we posit that apart from confidently translating when a single correct translation exists, models should also maintain uncertainty when the input is ambiguous. We use uncertainty to measure gender bias in MT systems. When the source sentence includes a lexeme whose gender is not overtly marked, but whose target-language equivalent requires gender specification, the model must infer the appropriate gender from the context and can be susceptible to biases. Prior work measured bias via gender accuracy, however it cannot be applied to ambiguous cases. Using semantic uncertainty, we are able to assess bias when translating both ambiguous and unambiguous source sentences, and find that high translation accuracy does not correlate with exhibiting uncertainty appropriately, and that debiasing affects the two cases differently.

Paper Structure

This paper contains 30 sections, 8 equations, 2 figures, 20 tables.

Figures (2)

  • Figure 1: Probabilities for feminine and masculine determiners in a Spanish translation of a sentence containing a noun that is either feminine (referred to as 'she') or ambiguous ('they'), by two existing models and the ideal expected attribution of an unbiased model.
  • Figure 2: Violin plots of $\mathcal{H}$ (s3e) on ambiguous and unambiguous WinoMT instances. Binned by low (B1), medium (B2) and high (B3) comet scores against human translations.