Table of Contents
Fetching ...

What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models

Janiça Hackenbuchner, Arda Tezcan, Joke Daems

TL;DR

The study tackles gender bias in translation by diagnosing how input context triggers gender-inflected outputs using contrastive explanations and saliency attribution on gender-ambiguous data. It translates gender-ambiguous English sentences to German, generates contrastive gender variants, and compares model-salient words with human judgments to reveal shared contextual cues. Results show a high model-human overlap in salient tokens, with nouns and verbs being especially influential and saliency often proximal to the target referent, suggesting common grounding between humans and models in gender decision triggers. The work demonstrates the value of interpretability methods for understanding and potentially mitigating gender bias in MT, while acknowledging limitations from dataset size, language pair, and model scope, and calling for broader analyses across models and languages.

Abstract

Interpretability can be implemented as a means to understand decisions taken by (black box) models, such as machine translation (MT) or large language models (LLMs). Yet, research in this area has been limited in relation to a manifested problem in these models: gender bias. With this research, we aim to move away from simply measuring bias to exploring its origins. Working with gender-ambiguous natural source data, this study examines which context, in the form of input tokens in the source sentence, influences (or triggers) the translation model choice of a certain gender inflection in the target language. To analyse this, we use contrastive explanations and compute saliency attribution. We first address the challenge of a lacking scoring threshold and specifically examine different attribution levels of source words on the model gender decisions in the translation. We compare salient source words with human perceptions of gender and demonstrate a noticeable overlap between human perceptions and model attribution. Additionally, we provide a linguistic analysis of salient words. Our work showcases the relevance of understanding model translation decisions in terms of gender, how this compares to human decisions and that this information should be leveraged to mitigate gender bias.

What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models

TL;DR

The study tackles gender bias in translation by diagnosing how input context triggers gender-inflected outputs using contrastive explanations and saliency attribution on gender-ambiguous data. It translates gender-ambiguous English sentences to German, generates contrastive gender variants, and compares model-salient words with human judgments to reveal shared contextual cues. Results show a high model-human overlap in salient tokens, with nouns and verbs being especially influential and saliency often proximal to the target referent, suggesting common grounding between humans and models in gender decision triggers. The work demonstrates the value of interpretability methods for understanding and potentially mitigating gender bias in MT, while acknowledging limitations from dataset size, language pair, and model scope, and calling for broader analyses across models and languages.

Abstract

Interpretability can be implemented as a means to understand decisions taken by (black box) models, such as machine translation (MT) or large language models (LLMs). Yet, research in this area has been limited in relation to a manifested problem in these models: gender bias. With this research, we aim to move away from simply measuring bias to exploring its origins. Working with gender-ambiguous natural source data, this study examines which context, in the form of input tokens in the source sentence, influences (or triggers) the translation model choice of a certain gender inflection in the target language. To analyse this, we use contrastive explanations and compute saliency attribution. We first address the challenge of a lacking scoring threshold and specifically examine different attribution levels of source words on the model gender decisions in the translation. We compare salient source words with human perceptions of gender and demonstrate a noticeable overlap between human perceptions and model attribution. Additionally, we provide a linguistic analysis of salient words. Our work showcases the relevance of understanding model translation decisions in terms of gender, how this compares to human decisions and that this information should be leveraged to mitigate gender bias.

Paper Structure

This paper contains 22 sections, 7 figures.

Figures (7)

  • Figure 1: Methodological outline of (i) creating contrastive translations from gender-ambiguous source sentences to (ii) compute attribution scores of input tokens to (iii) compare the most salient source words to human annotations.
  • Figure 2: Example depiction of pre-processed source words and their normalised attribution scores, from highest to lowest. Words in red overlap with human annotations.
  • Figure 3: Highest model-human overlap achieved per approach taken.
  • Figure 4: Comparative scoring of overlapping source words for Approach 4.
  • Figure 5: Approach 4: Comparison between annotations: all vs. min. 2 agree (where source words have been annotated by at least two annotators).
  • ...and 2 more figures