MiTTenS: A Dataset for Evaluating Gender Mistranslation

Kevin Robinson; Sneha Kudugunta; Romina Stella; Sunipa Dev; Jasmijn Bastings

MiTTenS: A Dataset for Evaluating Gender Mistranslation

Kevin Robinson, Sneha Kudugunta, Romina Stella, Sunipa Dev, Jasmijn Bastings

TL;DR

A dataset, MiTTenS, covering 26 languages from a variety of language families and scripts, is introduced, demonstrating the usefulness of the dataset by evaluating both neural machine translation systems and foundation models, and showing that all systems exhibit gender mistranslation and potential harm, even in high resource languages.

Abstract

Translation systems, including foundation models capable of translation, can produce errors that result in gender mistranslation, and such errors can be especially harmful. To measure the extent of such potential harms when translating into and out of English, we introduce a dataset, MiTTenS, covering 26 languages from a variety of language families and scripts, including several traditionally under-represented in digital resources. The dataset is constructed with handcrafted passages that target known failure patterns, longer synthetically generated passages, and natural passages sourced from multiple domains. We demonstrate the usefulness of the dataset by evaluating both neural machine translation systems and foundation models, and show that all systems exhibit gender mistranslation and potential harm, even in high resource languages.

MiTTenS: A Dataset for Evaluating Gender Mistranslation

TL;DR

Abstract

Paper Structure (9 sections, 2 figures, 3 tables)

This paper contains 9 sections, 2 figures, 3 tables.

Introduction
Dataset
Gender Sets
SynthBio
Late binding
Encoded in nouns
Evaluation
Conclusion
Evaluation protocol details

Figures (2)

Figure 1: Dataset examples targeting passages where gender mistranslation may occur and cause harm. Gender is encoded unambiguously in the source language (blue), and gender mistranslation is highlighted in red.
Figure 2: Evaluation results using automated evaluation when translating into English. Gemini and PaLM 2 systems perform best when considering worst-case performance, and GPT4 is within 5 percentage points.

MiTTenS: A Dataset for Evaluating Gender Mistranslation

TL;DR

Abstract

MiTTenS: A Dataset for Evaluating Gender Mistranslation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)