Table of Contents
Fetching ...

Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG

Dayeon Ki, Marine Carpuat, Paul McNamee, Daniel Khashabi, Eugene Yang, Dawn Lawrie, Kevin Duh

TL;DR

The paper investigates language-driven biases in multilingual retrieval-augmented generation (mRAG) by establishing a controlled framework to measure language preference in citations across eight languages and six open-weight models. It combines human validation, machine translation quality assessments, logit-lens and attribution analyses (ContextCite) to demonstrate a robust English citation bias that intensifies for lower-resource languages and mid-context documents, and can trade off relevance for language. The findings indicate that language preference shapes not only surface citations but also contributive attribution, even when translation quality is strong or when query language is non-English. These results highlight a practical need to debias or calibrate multilingual retrieval and citation mechanisms to ensure faithful cross-lingual information access in mRAG systems.

Abstract

Multilingual Retrieval-Augmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. While such systems have been proposed, an open questions is whether the mixture of different document languages impacts generation and citation in unintended ways. To investigate, we introduce a controlled methodology using model internals to measure language preference while holding other factors such as document relevance constant. Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English, with this bias amplified for lower-resource languages and for documents positioned mid-context. Crucially, we find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone. Our findings shed light on how language models leverage multilingual context and influence citation behavior.

Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG

TL;DR

The paper investigates language-driven biases in multilingual retrieval-augmented generation (mRAG) by establishing a controlled framework to measure language preference in citations across eight languages and six open-weight models. It combines human validation, machine translation quality assessments, logit-lens and attribution analyses (ContextCite) to demonstrate a robust English citation bias that intensifies for lower-resource languages and mid-context documents, and can trade off relevance for language. The findings indicate that language preference shapes not only surface citations but also contributive attribution, even when translation quality is strong or when query language is non-English. These results highlight a practical need to debias or calibrate multilingual retrieval and citation mechanisms to ensure faithful cross-lingual information access in mRAG systems.

Abstract

Multilingual Retrieval-Augmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. While such systems have been proposed, an open questions is whether the mixture of different document languages impacts generation and citation in unintended ways. To investigate, we introduce a controlled methodology using model internals to measure language preference while holding other factors such as document relevance constant. Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English, with this bias amplified for lower-resource languages and for documents positioned mid-context. Crucially, we find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone. Our findings shed light on how language models leverage multilingual context and influence citation behavior.

Paper Structure

This paper contains 12 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 9: Full instructions and example provided to human annotators. The annotation task was hosted on a custom-built website. Annotators first viewed a brief task instruction (a), then evaluate 30 statements, with an example shown in (b).
  • Figure 10: Rating distribution for each label group. We plot the distribution of 180 judgments collected during human annotation (90 supported and 90 unsupported statements). Results show that annotators can reliably distinguish supported from unsupported statements based on their ratings.
  • Figure 11: COMET-QE score distributions by language. Distributions are more skewed for shorter content (e.g., title), while broader distributions for longer content (e.g., evidence document).
  • Figure 12: Accuracy difference between English and each target language binned by relative position. Each bin is normalized by sample size.
  • Figure 13: Logit lens visualization per language for LLaMA-3.3 70B (80 layers).$x$-axis: Last layer index; $y$-axis: Statement count. We show the last 40 layers. ●: Correct citation ID of document in target language; ✕: Wrong citation ID of document in English; ▲: Not in valid citation set.
  • ...and 7 more figures