Table of Contents
Fetching ...

Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning

Yilei Tu, Andrew Xue, Freda Shi

TL;DR

This paper systematically analyzes multilingual in-context learning (ICL) for instruction-tuned LLMs, showing that demonstrations drawn from multiple high-resource languages (HRLs) generally outperform English-only demonstrations, especially for low-resource languages (LRLs). It introduces four prompting modes (English, Monolingual HRL, Multilingual, Native) and performs extensive experiments across MGSM, XCOPA, XL-WiC, and XQuAD, including ablations with non-English context and translation-based baselines. Key findings include robust gains from multilingual prompts, strong but sometimes impractical performance from Native prompts, and additional improvements when even irrelevant non-English sentences are included in prompts. The work also provides neuron-level analysis suggesting overlapping language-specific representations between Multilingual and Native prompting, offering insight into how multilingual exposure enhances cross-lingual transfer. Together, these results advocate for more inclusive multilingual prompting strategies to narrow language resource gaps in LLM capabilities and guide future research on expanding LRL coverage.

Abstract

While multilingual large language models generally perform adequately, and sometimes even rival English performance on high-resource languages (HRLs), they often significantly underperform on low-resource languages (LRLs). Among several prompting strategies aiming at bridging the gap, multilingual in-context learning (ICL) has been particularly effective when demonstration in target languages is unavailable. However, there lacks a systematic understanding of when and why it works well. In this work, we systematically analyze multilingual ICL, using demonstrations in HRLs to enhance cross-lingual transfer. We show that demonstrations in mixed HRLs consistently outperform English-only ones across the board, particularly for tasks written in LRLs. Surprisingly, our ablation study shows that the presence of irrelevant non-English sentences in the prompt yields measurable gains, suggesting the effectiveness of multilingual exposure itself. Our results highlight the potential of strategically leveraging multilingual resources to bridge the performance gap for underrepresented languages.

Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning

TL;DR

This paper systematically analyzes multilingual in-context learning (ICL) for instruction-tuned LLMs, showing that demonstrations drawn from multiple high-resource languages (HRLs) generally outperform English-only demonstrations, especially for low-resource languages (LRLs). It introduces four prompting modes (English, Monolingual HRL, Multilingual, Native) and performs extensive experiments across MGSM, XCOPA, XL-WiC, and XQuAD, including ablations with non-English context and translation-based baselines. Key findings include robust gains from multilingual prompts, strong but sometimes impractical performance from Native prompts, and additional improvements when even irrelevant non-English sentences are included in prompts. The work also provides neuron-level analysis suggesting overlapping language-specific representations between Multilingual and Native prompting, offering insight into how multilingual exposure enhances cross-lingual transfer. Together, these results advocate for more inclusive multilingual prompting strategies to narrow language resource gaps in LLM capabilities and guide future research on expanding LRL coverage.

Abstract

While multilingual large language models generally perform adequately, and sometimes even rival English performance on high-resource languages (HRLs), they often significantly underperform on low-resource languages (LRLs). Among several prompting strategies aiming at bridging the gap, multilingual in-context learning (ICL) has been particularly effective when demonstration in target languages is unavailable. However, there lacks a systematic understanding of when and why it works well. In this work, we systematically analyze multilingual ICL, using demonstrations in HRLs to enhance cross-lingual transfer. We show that demonstrations in mixed HRLs consistently outperform English-only ones across the board, particularly for tasks written in LRLs. Surprisingly, our ablation study shows that the presence of irrelevant non-English sentences in the prompt yields measurable gains, suggesting the effectiveness of multilingual exposure itself. Our results highlight the potential of strategically leveraging multilingual resources to bridge the performance gap for underrepresented languages.

Paper Structure

This paper contains 31 sections, 4 equations, 10 figures, 26 tables.

Figures (10)

  • Figure 1: Illustration of two ICL modes. After providing a few-shot prompt, we evaluate LLM in the same domain in various languages. In a controlled experiment, each demonstration in (a) and (b) shares the same meaning, albeit in different languages. Contents and languages of demonstrations are randomly sampled from a training set and a preset high-resource language list, respectively. We find that the Multilingual ICL mode (b) is more effective in helping the LLM solve tasks in different languages compared to the English ICL mode (a).
  • Figure 2: Illustration of ICL modes by \ref{['eq:icl_construction']}. Assume $K=3$ and $M=10$. For the second datapoint of the test set (regardless of its language split, e.g., $q_\text{test}$ could be in Thai, Bengali, etc.), we first randomly generate $K=3$ indices from $\left\{1, \cdots, 10\right\}$, say $\left\{8,3,5\right\}$. Next, we determine the languages of the $K=3$ demonstrations. For modes (a), (b), and (d), the language is uniformly specified. For mode (c), we randomly select $K=3$ languages, say $\left\{\text{en, de, fr}\right\}$. Then $\left\{(8, \text{en}), (3, \text{de}), (5, \text{fr})\right\}$ determines each demonstration.
  • Figure 3: Average accuracies of LRLs across three ICL modes on our evaluated $4$ datasets and $7$ MLLMs. Raw accuracies of all language splits are in \ref{['tab:vanilla_eval:mgsm', 'tab:vanilla_eval:xcopa', 'tab:vanilla_eval:xlwic', 'tab:vanilla_eval:xquad']} in \ref{['app:expt:vanilla']}. For simplicity, on the $x$-axis, only the model logos are labeled -- 3: Llama3-8B-Instruct; 3.1: Llama3.1-8B-Instruct; 2: Qwen2-7B-Instruct; 2.5: Qwen2.5-7B-Instruct; : Mistral-NeMo-12B-Instruct; : Aya-Expanse-8b; 3.5: GPT3.5-turbo; 4om: GPT4o-mini.
  • Figure 4: Monolingual modes vs Multilingual on average accuracies of LRLs. The $x$-axis is the same as in \ref{['fig:vanilla_eval_three_modes']}. Raw evaluation accuracies are in \ref{['tab:vanilla_eval:mgsm', 'tab:vanilla_eval:xcopa', 'tab:vanilla_eval:xlwic', 'tab:vanilla_eval:xquad']} in \ref{['app:expt:vanilla']}.
  • Figure 5: Prepending multilingual CIS (CIS-Multi) $\{s_i^{\text{lang}}\}_{i=1}^K \sim \mathcal{S}^{\text{lang}}$ to demonstrations $\{(q_i^{\text{en}}, a_i)\}_{i=1}^K \sim \mathcal{D}_\text{train}^{\text{en}}$ of English ICL template illustrated in \ref{['fig:icl_modes']}a.
  • ...and 5 more figures