Table of Contents
Fetching ...

A Prompt Response to the Demand for Automatic Gender-Neutral Translation

Beatrice Savoldi, Andrea Piergentili, Dennis Fucci, Matteo Negri, Luisa Bentivogli

TL;DR

This study investigates automating gender-neutral translation (GNT) by comparing traditional MT systems with GPT-4 on English→Italian using GeNTE as a test bed. Baseline zero-shot MT and GPT struggle to produce neutral translations, but targeted prompting of GPT-4 with three prompt templates and in-domain exemplars yields substantial neutralization (roughly 65–70% neutral outputs) albeit with notable subjectivity in acceptability. The authors provide a fine-grained, manual evaluation framework and release annotations to capture the variability in what constitutes a good GNT, highlighting both the promise and the challenges of using instruction-following LLMs for GNT. The work underscores the potential of GPT prompting for inclusive translation while acknowledging limitations such as language scope and reproducibility, and points to future work with open-source models and broader datasets.

Abstract

Gender-neutral translation (GNT) that avoids biased and undue binary assumptions is a pivotal challenge for the creation of more inclusive translation technologies. Advancements for this task in Machine Translation (MT), however, are hindered by the lack of dedicated parallel data, which are necessary to adapt MT systems to satisfy neutral constraints. For such a scenario, large language models offer hitherto unforeseen possibilities, as they come with the distinct advantage of being versatile in various (sub)tasks when provided with explicit instructions. In this paper, we explore this potential to automate GNT by comparing MT with the popular GPT-4 model. Through extensive manual analyses, our study empirically reveals the inherent limitations of current MT systems in generating GNTs and provides valuable insights into the potential and challenges associated with prompting for neutrality.

A Prompt Response to the Demand for Automatic Gender-Neutral Translation

TL;DR

This study investigates automating gender-neutral translation (GNT) by comparing traditional MT systems with GPT-4 on English→Italian using GeNTE as a test bed. Baseline zero-shot MT and GPT struggle to produce neutral translations, but targeted prompting of GPT-4 with three prompt templates and in-domain exemplars yields substantial neutralization (roughly 65–70% neutral outputs) albeit with notable subjectivity in acceptability. The authors provide a fine-grained, manual evaluation framework and release annotations to capture the variability in what constitutes a good GNT, highlighting both the promise and the challenges of using instruction-following LLMs for GNT. The work underscores the potential of GPT prompting for inclusive translation while acknowledging limitations such as language scope and reproducibility, and points to future work with open-source models and broader datasets.

Abstract

Gender-neutral translation (GNT) that avoids biased and undue binary assumptions is a pivotal challenge for the creation of more inclusive translation technologies. Advancements for this task in Machine Translation (MT), however, are hindered by the lack of dedicated parallel data, which are necessary to adapt MT systems to satisfy neutral constraints. For such a scenario, large language models offer hitherto unforeseen possibilities, as they come with the distinct advantage of being versatile in various (sub)tasks when provided with explicit instructions. In this paper, we explore this potential to automate GNT by comparing MT with the popular GPT-4 model. Through extensive manual analyses, our study empirically reveals the inherent limitations of current MT systems in generating GNTs and provides valuable insights into the potential and challenges associated with prompting for neutrality.
Paper Structure (22 sections, 2 figures, 7 tables)

This paper contains 22 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Manual Evaluation Results.
  • Figure 2: Neutrality for the Baseline and the GNT-Prompting settings evaluated by the classifier.