Table of Contents
Fetching ...

Mind the Language Gap: Automated and Augmented Evaluation of Bias in LLMs for High- and Low-Resource Languages

Alessio Buscemi, Cédric Lothritz, Sergio Morales, Marcos Gomez-Vazquez, Robert Clarisó, Jordi Cabot, German Castignani

TL;DR

This paper tackles the challenge of bias in large language models across multiple languages by introducing MLA-BiTe, a framework that augments the LangBiTe bias-testing workflow with automated translation and paraphrasing to enable scalable, multilingual bias evaluation. The authors evaluate four state-of-the-art LLMs across six languages (including Catalan and Luxembourgish) and seven discrimination categories, demonstrating that paraphrase-before-translation can yield slightly better expansions and that low-resource languages generally exhibit more variability and bias. Through systematic experiments, the work shows that model-language-category interactions drive performance, with English and Spanish providing more stable results and low-resource languages showing greater susceptibility to biases. The study highlights the importance of case-by-case model-language selection for bias detection and points to future directions such as broader language coverage, additional modalities, and culturally aware translation to improve fairness in diverse linguistic communities.

Abstract

Large Language Models (LLMs) have exhibited impressive natural language processing capabilities but often perpetuate social biases inherent in their training data. To address this, we introduce MultiLingual Augmented Bias Testing (MLA-BiTe), a framework that improves prior bias evaluation methods by enabling systematic multilingual bias testing. MLA-BiTe leverages automated translation and paraphrasing techniques to support comprehensive assessments across diverse linguistic settings. In this study, we evaluate the effectiveness of MLA-BiTe by testing four state-of-the-art LLMs in six languages -- including two low-resource languages -- focusing on seven sensitive categories of discrimination.

Mind the Language Gap: Automated and Augmented Evaluation of Bias in LLMs for High- and Low-Resource Languages

TL;DR

This paper tackles the challenge of bias in large language models across multiple languages by introducing MLA-BiTe, a framework that augments the LangBiTe bias-testing workflow with automated translation and paraphrasing to enable scalable, multilingual bias evaluation. The authors evaluate four state-of-the-art LLMs across six languages (including Catalan and Luxembourgish) and seven discrimination categories, demonstrating that paraphrase-before-translation can yield slightly better expansions and that low-resource languages generally exhibit more variability and bias. Through systematic experiments, the work shows that model-language-category interactions drive performance, with English and Spanish providing more stable results and low-resource languages showing greater susceptibility to biases. The study highlights the importance of case-by-case model-language selection for bias detection and points to future directions such as broader language coverage, additional modalities, and culturally aware translation to improve fairness in diverse linguistic communities.

Abstract

Large Language Models (LLMs) have exhibited impressive natural language processing capabilities but often perpetuate social biases inherent in their training data. To address this, we introduce MultiLingual Augmented Bias Testing (MLA-BiTe), a framework that improves prior bias evaluation methods by enabling systematic multilingual bias testing. MLA-BiTe leverages automated translation and paraphrasing techniques to support comprehensive assessments across diverse linguistic settings. In this study, we evaluate the effectiveness of MLA-BiTe by testing four state-of-the-art LLMs in six languages -- including two low-resource languages -- focusing on seven sensitive categories of discrimination.

Paper Structure

This paper contains 22 sections, 7 figures, 6 tables, 3 algorithms.

Figures (7)

  • Figure 1: The BLEU scores and cosine similarities for translations between each of the tested languages and the other two, as generated by the selected LLMs.
  • Figure 2: BLEU and cosine similarities for paraphrasing across all the tested languages, with the number of paraphrases $P$ in [2,5,10].
  • Figure 3: Distribution of cosine similarity scores for selected translations at $P=5$, used to compare the performance of the two proposed pipelines, P2T and T2P.
  • Figure 4: Each spider plot illustrates the percentage of passed tests for each LLM in one of the seven sensitive categories examined in this paper, spanning all six languages analyzed.
  • Figure 5: Aggregated results by language and model.
  • ...and 2 more figures