Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation
Bar Iluz, Yanai Elazar, Asaf Yehudai, Gabriel Stanovsky
TL;DR
The paper investigates how intrinsic debiasing methods translate to extrinsic bias in machine translation, revealing that downstream fairness depends on design choices such as what to debias, where in the model to apply debiasing, and the target language. It systematically evaluates Hard-Debiasing, INLP, and LEACE across encoder/decoder embeddings and tokenization schemes (all tokens, n-token-profession, and 1-token-profession) in German, Hebrew, and Russian, using WinoMT accuracy and BLEU as core metrics. Key findings show that 1-token-profession debiasing often yields the best gender translation accuracy, the optimal embedding table is method-dependent, and language morphology significantly affects outcomes; LEACE and Hard-Debiasing preserve BLEU better than INLP. The work provides practical guidance for integrating intrinsic debiasing into MT and highlights the need for broader evaluation across tasks and languages to achieve extrinsically fairer systems.
Abstract
Most works on gender bias focus on intrinsic bias -- removing traces of information about a protected group from the model's internal representation. However, these works are often disconnected from the impact of such debiasing on downstream applications, which is the main motivation for debiasing in the first place. In this work, we systematically test how methods for intrinsic debiasing affect neural machine translation models, by measuring the extrinsic bias of such systems under different design choices. We highlight three challenges and mismatches between the debiasing techniques and their end-goal usage, including the choice of embeddings to debias, the mismatch between words and sub-word tokens debiasing, and the effect on different target languages. We find that these considerations have a significant impact on downstream performance and the success of debiasing.
