The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis
Miaoran Zhang, Vagrant Gautam, Mingyang Wang, Jesujoba O. Alabi, Xiaoyu Shen, Dietrich Klakow, Marius Mosbach
TL;DR
This work provides a granular, multidimensional analysis of multilingual in-context learning, showing that demonstrations do not uniformly improve performance across models, tasks, or languages. By evaluating base and chat LLMs on 9 multilingual datasets spanning 56 languages, the study reveals strong instruction-following models are often insensitive to demonstration quality, and that carefully designed templates can largely obviate the need for demonstrations, especially for QA tasks. The findings urge careful, multi-template, and language-specific evaluation when assessing ICL in multilingual settings, and highlight that improvements from demonstrations may be overstated without robust baselines. The work emphasizes the interplay between demonstrations and templates and calls for cautious interpretation of claims about multilingual ICL in practical applications.
Abstract
In-context learning is a popular inference strategy where large language models solve a task using only a few labeled demonstrations without needing any parameter updates. Although there have been extensive studies on English in-context learning, multilingual in-context learning remains under-explored, and we lack an in-depth understanding of the role of demonstrations in this context. To address this gap, we conduct a multidimensional analysis of multilingual in-context learning, experimenting with 5 models from different model families, 9 datasets covering classification and generation tasks, and 56 typologically diverse languages. Our results reveal that the effectiveness of demonstrations varies significantly across models, tasks, and languages. We also find that strong instruction-following models including Llama 2-Chat, GPT-3.5, and GPT-4 are largely insensitive to the quality of demonstrations. Instead, a carefully crafted template often eliminates the benefits of demonstrations for some tasks and languages altogether. These findings show that the importance of demonstrations might be overestimated. Our work highlights the need for granular evaluation across multiple axes towards a better understanding of in-context learning.
