The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

Miaoran Zhang; Vagrant Gautam; Mingyang Wang; Jesujoba O. Alabi; Xiaoyu Shen; Dietrich Klakow; Marius Mosbach

The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

Miaoran Zhang, Vagrant Gautam, Mingyang Wang, Jesujoba O. Alabi, Xiaoyu Shen, Dietrich Klakow, Marius Mosbach

TL;DR

This work provides a granular, multidimensional analysis of multilingual in-context learning, showing that demonstrations do not uniformly improve performance across models, tasks, or languages. By evaluating base and chat LLMs on 9 multilingual datasets spanning 56 languages, the study reveals strong instruction-following models are often insensitive to demonstration quality, and that carefully designed templates can largely obviate the need for demonstrations, especially for QA tasks. The findings urge careful, multi-template, and language-specific evaluation when assessing ICL in multilingual settings, and highlight that improvements from demonstrations may be overstated without robust baselines. The work emphasizes the interplay between demonstrations and templates and calls for cautious interpretation of claims about multilingual ICL in practical applications.

Abstract

In-context learning is a popular inference strategy where large language models solve a task using only a few labeled demonstrations without needing any parameter updates. Although there have been extensive studies on English in-context learning, multilingual in-context learning remains under-explored, and we lack an in-depth understanding of the role of demonstrations in this context. To address this gap, we conduct a multidimensional analysis of multilingual in-context learning, experimenting with 5 models from different model families, 9 datasets covering classification and generation tasks, and 56 typologically diverse languages. Our results reveal that the effectiveness of demonstrations varies significantly across models, tasks, and languages. We also find that strong instruction-following models including Llama 2-Chat, GPT-3.5, and GPT-4 are largely insensitive to the quality of demonstrations. Instead, a carefully crafted template often eliminates the benefits of demonstrations for some tasks and languages altogether. These findings show that the importance of demonstrations might be overestimated. Our work highlights the need for granular evaluation across multiple axes towards a better understanding of in-context learning.

The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

TL;DR

Abstract

Paper Structure (37 sections, 1 equation, 20 figures, 9 tables)

This paper contains 37 sections, 1 equation, 20 figures, 9 tables.

Introduction
Preliminaries
In-context learning
Multilingual prompting
Experimental setup
Models.
Tasks and datasets.
In-context learning.
Metrics.
Do (more) demonstrations benefit multilingual performance?
Does demonstration quality matter?
Better templates further reduce the benefits of demonstrations
Template design.
Discussion
Understanding the failures of ICL.
...and 22 more sections

Figures (20)

Figure 1: An overview of the components of multilingual in-context learning (§\ref{['sec:in-context-learning']}) with a comparison to zero-shot learning. Sources of variation include tasks, languages, models, and the template, i.e., the task instruction, patterns for formatting inputs, and verbalized labels.
Figure 2: Average performance across languages with different numbers of demonstrations. We average and report standard deviations over 3 seeds for all models except GPT-4. Note that the standard deviations are relatively small, possibly because of averaging over languages. en-xx: translating from English to another language, xx-en: translating from another language to English.
Figure 3: Performance difference between 4-shot and 0-shot. Each marker represents the average performance across models for each language in a given task. MT denotes the MAFAND dataset.
Figure 4: Performance difference between 4-shot and 0-shot for individual languages in PAWS-X. Error bars represent standard deviations calculated over 3 seeds.
Figure 5: Performance of 4-shot ICL using different types of demonstrations for individual languages on AfriSenti and XQuAD. The top row shows Llama 2 results, and the bottom row shows GPT-3.5 results.
...and 15 more figures

The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

TL;DR

Abstract

The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (20)