The power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs

Aleix Sant; Carlos Escolano; Audrey Mash; Francesca De Luca Fornaciari; Maite Melero

The power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs

Aleix Sant, Carlos Escolano, Audrey Mash, Francesca De Luca Fornaciari, Maite Melero

TL;DR

A prompt structure is identified that significantly reduces gender bias by up to 12% on the WinoMT evaluation dataset compared to more straightforward prompts, which significantly reduce the gender bias accuracy gap between LLMs and traditional NMT systems.

Abstract

This paper studies gender bias in machine translation through the lens of Large Language Models (LLMs). Four widely-used test sets are employed to benchmark various base LLMs, comparing their translation quality and gender bias against state-of-the-art Neural Machine Translation (NMT) models for English to Catalan (En $\rightarrow$ Ca) and English to Spanish (En $\rightarrow$ Es) translation directions. Our findings reveal pervasive gender bias across all models, with base LLMs exhibiting a higher degree of bias compared to NMT models. To combat this bias, we explore prompting engineering techniques applied to an instruction-tuned LLM. We identify a prompt structure that significantly reduces gender bias by up to 12% on the WinoMT evaluation dataset compared to more straightforward prompts. These results significantly reduce the gender bias accuracy gap between LLMs and traditional NMT systems.

The power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs

TL;DR

Abstract

Ca) and English to Spanish (En

Es) translation directions. Our findings reveal pervasive gender bias across all models, with base LLMs exhibiting a higher degree of bias compared to NMT models. To combat this bias, we explore prompting engineering techniques applied to an instruction-tuned LLM. We identify a prompt structure that significantly reduces gender bias by up to 12% on the WinoMT evaluation dataset compared to more straightforward prompts. These results significantly reduce the gender bias accuracy gap between LLMs and traditional NMT systems.

Paper Structure (46 sections, 4 figures, 9 tables)

This paper contains 46 sections, 4 figures, 9 tables.

Introduction
Gender Bias Statement
Gender Coreference Resolution
Gender Terms Detection
Related work
Methodology
Models
Llama-2-7B
Ǎguila-7B
Flor-6.3B
M2M-100-1.2B
NLLB-200-1.3B
Mt-aina-en-ca
Google Translate
Llama-2-7B-chat
...and 31 more sections

Figures (4)

Figure 1: Example of Gender Bias in MT
Figure 2: Examples of Gender Coreference Resolution (a) and Gender Terms Detection (b) in En $\rightarrow$ Ca
Figure 3: Male and female predicted terms across models for En $\rightarrow$ Ca in absence of gender cues
Figure 4: Male and female predicted terms across models for En $\rightarrow$ Es in absence of gender cues

The power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs

TL;DR

Abstract

The power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (4)