Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning
Yilei Tu, Andrew Xue, Freda Shi
TL;DR
This paper systematically analyzes multilingual in-context learning (ICL) for instruction-tuned LLMs, showing that demonstrations drawn from multiple high-resource languages (HRLs) generally outperform English-only demonstrations, especially for low-resource languages (LRLs). It introduces four prompting modes (English, Monolingual HRL, Multilingual, Native) and performs extensive experiments across MGSM, XCOPA, XL-WiC, and XQuAD, including ablations with non-English context and translation-based baselines. Key findings include robust gains from multilingual prompts, strong but sometimes impractical performance from Native prompts, and additional improvements when even irrelevant non-English sentences are included in prompts. The work also provides neuron-level analysis suggesting overlapping language-specific representations between Multilingual and Native prompting, offering insight into how multilingual exposure enhances cross-lingual transfer. Together, these results advocate for more inclusive multilingual prompting strategies to narrow language resource gaps in LLM capabilities and guide future research on expanding LRL coverage.
Abstract
While multilingual large language models generally perform adequately, and sometimes even rival English performance on high-resource languages (HRLs), they often significantly underperform on low-resource languages (LRLs). Among several prompting strategies aiming at bridging the gap, multilingual in-context learning (ICL) has been particularly effective when demonstration in target languages is unavailable. However, there lacks a systematic understanding of when and why it works well. In this work, we systematically analyze multilingual ICL, using demonstrations in HRLs to enhance cross-lingual transfer. We show that demonstrations in mixed HRLs consistently outperform English-only ones across the board, particularly for tasks written in LRLs. Surprisingly, our ablation study shows that the presence of irrelevant non-English sentences in the prompt yields measurable gains, suggesting the effectiveness of multilingual exposure itself. Our results highlight the potential of strategically leveraging multilingual resources to bridge the performance gap for underrepresented languages.
