Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From
Changjiang Gao, Hankun Lin, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Jiajun Chen, Shujian Huang
TL;DR
This paper investigates cross-lingual context retrieval in large language models by evaluating over 40 models across 12 languages on xMRC tasks. It reveals that post-trained open LLMs can approach closed models like GPT-4o, and identifies a two-phase retrieval mechanism: a pre-training shaped question-encoding phase followed by a post-training shaped answer-retrieval phase. Oracle analyses and layer-wise attribution confirm the existence of this phasing and show that post-training significantly boosts cross-lingual retrieval potential, while larger pretraining provides limited gains. The findings highlight the critical role of multilingual post-training, especially for smaller models, and offer actionable guidance for improving cross-lingual alignment in multilingual LLMs.
Abstract
Cross-lingual context retrieval (extracting contextual information in one language based on requests in another) is a fundamental aspect of cross-lingual alignment, but the performance and mechanism of it for large language models (LLMs) remains unclear. In this paper, we evaluate the cross-lingual context retrieval of over 40 LLMs across 12 languages, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our mechanism analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential.
