Table of Contents
Fetching ...

An Empirical Study of In-context Learning in LLMs for Machine Translation

Pranjal A. Chitale, Jay Gala, Raj Dabre

TL;DR

This study interrogates how in-context learning operates for machine translation by distinguishing between demonstration-driven and instruction-driven effects and by systematically perturbing both prompts and demonstrations. Through a broad empirical campaign spanning 6 models, 12 language pairs, and over 25K runs, it isolates factors such as target versus source distribution, spatial proximity, and alignment, revealing that demonstrations largely govern MT performance while instructions play a secondary role. Key findings show that the target distribution and the proximity of clean demonstrations to the test example are especially influential, that perturbations to demonstrations often hurt but can sometimes regularize model behavior, and that demonstrations from allied tasks can substitute for exact task demonstrations. The work offers actionable guidance for practitioners on selecting demonstrations, highlights robustness concerns for in-context learning, and provides a public resource to benchmark and extend ICL research in MT.

Abstract

Recent interest has surged in employing Large Language Models (LLMs) for machine translation (MT) via in-context learning (ICL) (Vilar et al., 2023). Most prior studies primarily focus on optimizing translation quality, with limited attention to understanding the specific aspects of ICL that influence the said quality. To this end, we perform the first of its kind, an exhaustive study of in-context learning for machine translation. We first establish that ICL is primarily example-driven and not instruction-driven. Following this, we conduct an extensive exploration of various aspects of the examples to understand their influence on downstream performance. Our analysis includes factors such as quality and quantity of demonstrations, spatial proximity, and source versus target originality. Further, we also investigate challenging scenarios involving indirectness and misalignment of examples to understand the limits of ICL. While we establish the significance of the quality of the target distribution over the source distribution of demonstrations, we further observe that perturbations sometimes act as regularizers, resulting in performance improvements. Surprisingly, ICL does not necessitate examples from the same task, and a related task with the same target distribution proves sufficient. We hope that our study acts as a guiding resource for considerations in utilizing ICL for MT. Our code is available on https://github.com/PranjalChitale/in-context-mt-analysis.

An Empirical Study of In-context Learning in LLMs for Machine Translation

TL;DR

This study interrogates how in-context learning operates for machine translation by distinguishing between demonstration-driven and instruction-driven effects and by systematically perturbing both prompts and demonstrations. Through a broad empirical campaign spanning 6 models, 12 language pairs, and over 25K runs, it isolates factors such as target versus source distribution, spatial proximity, and alignment, revealing that demonstrations largely govern MT performance while instructions play a secondary role. Key findings show that the target distribution and the proximity of clean demonstrations to the test example are especially influential, that perturbations to demonstrations often hurt but can sometimes regularize model behavior, and that demonstrations from allied tasks can substitute for exact task demonstrations. The work offers actionable guidance for practitioners on selecting demonstrations, highlights robustness concerns for in-context learning, and provides a public resource to benchmark and extend ICL research in MT.

Abstract

Recent interest has surged in employing Large Language Models (LLMs) for machine translation (MT) via in-context learning (ICL) (Vilar et al., 2023). Most prior studies primarily focus on optimizing translation quality, with limited attention to understanding the specific aspects of ICL that influence the said quality. To this end, we perform the first of its kind, an exhaustive study of in-context learning for machine translation. We first establish that ICL is primarily example-driven and not instruction-driven. Following this, we conduct an extensive exploration of various aspects of the examples to understand their influence on downstream performance. Our analysis includes factors such as quality and quantity of demonstrations, spatial proximity, and source versus target originality. Further, we also investigate challenging scenarios involving indirectness and misalignment of examples to understand the limits of ICL. While we establish the significance of the quality of the target distribution over the source distribution of demonstrations, we further observe that perturbations sometimes act as regularizers, resulting in performance improvements. Surprisingly, ICL does not necessitate examples from the same task, and a related task with the same target distribution proves sufficient. We hope that our study acts as a guiding resource for considerations in utilizing ICL for MT. Our code is available on https://github.com/PranjalChitale/in-context-mt-analysis.
Paper Structure (26 sections, 18 figures, 4 tables)

This paper contains 26 sections, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Comparison of ChrF++ scores for En-XX (top) and XX-En (bottom) across different instruction types for BLOOM and Llama 2 model families (averaged across different languages and shots).
  • Figure 2: Mean percent change in ChrF++ score for each attack relative to the 0 noise baseline for each model across both translation directions (En-XX and XX-En) and both perturbation directions. Scores are averaged across shots and noise percentages ($\delta$). "Span" denotes span noise, "order" represents word order attack, "dup" signifies word duplication attack, "add" indicates a punctuation addition attack, and "drop" signifies punctuation removal attack. Positive values indicate the performance decreased post perturbation while the negative values indicate that performance increases post perturbation. Note: In certain cases, scores are bounded within minimum and maximum values for trend clarification.
  • Figure 3: Mean percent change in ChrF++ score across different heterogeneous cases relative to the 0 noise baseline for each model across both translation directions (En-XX and XX-En) and both perturbation directions. Scores are averaged across attack types, shots and noise percentages ($\delta$). Positive values indicate the performance decreased post perturbation while the negative values indicate that performance increases post perturbation. Note: In certain cases, scores are bounded within minimum and maximum values for clarity in depicting overarching trends.
  • Figure 4: ChrF++ score of different models averaged across shots and translation directions comparing the choice of the auxiliary source language of the demonstration.
  • Figure 5: ChrF++ score of different models averaged translation directions comparing different forms of misalignment in a pivot-translation setting.
  • ...and 13 more figures