Table of Contents
Fetching ...

Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From

Changjiang Gao, Hankun Lin, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Jiajun Chen, Shujian Huang

TL;DR

This paper investigates cross-lingual context retrieval in large language models by evaluating over 40 models across 12 languages on xMRC tasks. It reveals that post-trained open LLMs can approach closed models like GPT-4o, and identifies a two-phase retrieval mechanism: a pre-training shaped question-encoding phase followed by a post-training shaped answer-retrieval phase. Oracle analyses and layer-wise attribution confirm the existence of this phasing and show that post-training significantly boosts cross-lingual retrieval potential, while larger pretraining provides limited gains. The findings highlight the critical role of multilingual post-training, especially for smaller models, and offer actionable guidance for improving cross-lingual alignment in multilingual LLMs.

Abstract

Cross-lingual context retrieval (extracting contextual information in one language based on requests in another) is a fundamental aspect of cross-lingual alignment, but the performance and mechanism of it for large language models (LLMs) remains unclear. In this paper, we evaluate the cross-lingual context retrieval of over 40 LLMs across 12 languages, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our mechanism analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential.

Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From

TL;DR

This paper investigates cross-lingual context retrieval in large language models by evaluating over 40 models across 12 languages on xMRC tasks. It reveals that post-trained open LLMs can approach closed models like GPT-4o, and identifies a two-phase retrieval mechanism: a pre-training shaped question-encoding phase followed by a post-training shaped answer-retrieval phase. Oracle analyses and layer-wise attribution confirm the existence of this phasing and show that post-training significantly boosts cross-lingual retrieval potential, while larger pretraining provides limited gains. The findings highlight the critical role of multilingual post-training, especially for smaller models, and offer actionable guidance for improving cross-lingual alignment in multilingual LLMs.

Abstract

Cross-lingual context retrieval (extracting contextual information in one language based on requests in another) is a fundamental aspect of cross-lingual alignment, but the performance and mechanism of it for large language models (LLMs) remains unclear. In this paper, we evaluate the cross-lingual context retrieval of over 40 LLMs across 12 languages, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our mechanism analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential.

Paper Structure

This paper contains 50 sections, 4 equations, 21 figures, 11 tables.

Figures (21)

  • Figure 1: Examples of our en-x and x-x testing scenarios. The figures show examples in English (en), German (de), and Chinese (zh).
  • Figure 2: Illustration of the hypothesized two-phased xMRC process. Through layers, the last question token will be transferred to the first answer token in two phases, between which is a cross-lingual question representation.
  • Figure 3: Mean MRD of the context and question parts for LLaMA-3.1-Instruct-8B.
  • Figure 4: Question, last token and context hidden state similarity between English and other languages in each layer of the LLaMA-3.1-Instruct-8B model on the "balanced" samples.
  • Figure 5: Change in last-input-token hidden state similarity between English and other languages in each layer of LLaMA-3.1-8B, LLaMA-3.1-Instruct-8B and LLaMA-3.1-Tuned-8B on the "balanced" samples.
  • ...and 16 more figures