The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
Chen Amiraz, Yaroslav Fyodorov, Elad Haramaty, Zohar Karnin, Liane Lewin-Eytan
TL;DR
The paper investigates cross-lingual retrieval biases in retrieval-augmented generation for English–Arabic in domain-specific corpora. It introduces two UAE-based bilingual benchmarks (Legal and Travel) and a rigorous evaluation pipeline using dense multilingual retrievers, a re-ranker, and an LLM-based judge to measure end-to-end accuracy. The key finding is that retrieval, especially cross-language document ranking, is the main bottleneck, with cross-language user/document combinations showing substantial drops compared to same-language cases. The authors demonstrate that simple mitigations, such as balanced language retrieval or query translation, yield meaningful improvements and suggest that targeted cross-lingual retriever training could further close the gap for practical multilingual RAG applications.
Abstract
Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with substantial performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever's difficulty in ranking documents across languages. Finally, we propose two simple retrieval strategies that address this source of failure by enforcing equal retrieval from both languages or by translating the query, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications.
