Table of Contents
Fetching ...

Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains

Md. Faiyaz Abdullah Sayeedi, Md. Mahbub Alam, Subhey Sadi Rahman, Md. Adnanul Islam, Jannatul Ferdous Deepti, Tasnim Mohiuddin, Md Mofijul Islam, Swakkhar Shatabda

TL;DR

Translation Tangles introduces a unified framework and dataset to jointly evaluate translation quality and fairness across 24 bidirectional language pairs spanning diverse language families and domains. The methodology combines a multilingual benchmark, a hybrid bias-detection pipeline (semantic similarity, NER-based flagging, and keyword matching), and an LLM-based bias validation mechanism, complemented by a human-annotated bias dataset. Key findings show that larger models reduce, but do not erase, cross-family translation gaps and domain-specific translation remains challenging, while cultural and sociocultural biases are prevalent and unevenly distributed across languages. The work provides a valuable resource for benchmarking fairness in open-source LLMs and offers actionable insights for building more equitable, domain-aware translation systems, with code and data released on GitHub.

Abstract

The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: https://github.com/faiyazabdullah/TranslationTangles

Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains

TL;DR

Translation Tangles introduces a unified framework and dataset to jointly evaluate translation quality and fairness across 24 bidirectional language pairs spanning diverse language families and domains. The methodology combines a multilingual benchmark, a hybrid bias-detection pipeline (semantic similarity, NER-based flagging, and keyword matching), and an LLM-based bias validation mechanism, complemented by a human-annotated bias dataset. Key findings show that larger models reduce, but do not erase, cross-family translation gaps and domain-specific translation remains challenging, while cultural and sociocultural biases are prevalent and unevenly distributed across languages. The work provides a valuable resource for benchmarking fairness in open-source LLMs and offers actionable insights for building more equitable, domain-aware translation systems, with code and data released on GitHub.

Abstract

The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: https://github.com/faiyazabdullah/TranslationTangles

Paper Structure

This paper contains 46 sections, 6 equations, 25 figures, 9 tables.

Figures (25)

  • Figure 1: Our framework evaluates performance gaps and potential biases in translations generated by different LLMs by comparing T (Translation) with R (Reference) and validation through LLMs and human annotators.
  • Figure 2: Total biases are plotted across thresholds from 0.6 to 0.95. The count stabilizes beyond $\tau = 0.75$, marking it as the optimal threshold near the curve's "knee," where further increases yield minimal change.
  • Figure 3: Bias heatmaps for translation outputs. (Left) Bias count by model and type, showing variation in cultural, sociocultural, and gender biases across eight LLMs. (Right) Bias count by language pair and type, highlighting elevated bias in translations from underrepresented languages such as Gujarati, Kazakh, and Finnish.
  • Figure 4: Raw Bias Counts Across Similarity Thresholds for Each Bias Category
  • Figure 5: Normalized Bias Detection Rates Across Similarity thresholds for Each Bias Type
  • ...and 20 more figures