Table of Contents
Fetching ...

Beyond Many-Shot Translation: Scaling In-Context Demonstrations For Low-Resource Machine Translation

Luis Frentzen Salim, Esteban Carlin, Alexandre Morinvil, Xi Ai, Lun-Wei Ku

TL;DR

This study systematically scales in-context demonstrations for low-resource machine translation using ultra-long context decoders to 1M tokens, comparing monolingual, instruction-style, and parallel corpora for Javanese and Sundanese. It finds that translation gains saturate well before the maximum context and that performance can degrade near full context, with the magnitude and onset of degradation highly dependent on model architecture and data type. Notably, instruction-style monolingual data can rival parallel data in effectiveness, challenging assumptions that parallel corpora are always superior for in-context learning. The work highlights that larger context windows do not guarantee better MT and provides new datasets of synthetic parallel corpora to support future research in long-context prompting for low-resource languages.

Abstract

Building machine translation (MT) systems for low-resource languages is notably difficult due to the scarcity of high-quality data. Although Large Language Models (LLMs) have improved MT system performance, adapting them to lesser-represented languages remains challenging. In-context learning (ICL) may offer novel ways to adapt LLMs for low-resource MT by conditioning models on demonstration at inference time. In this study, we explore scaling low-resource machine translation ICL beyond the few-shot setting to thousands of examples with long-context models. We scale in-context token budget to 1M tokens and compare three types of training corpora used as in-context supervision: monolingual unsupervised data, instruction-style data, and parallel data (English--target and Indonesian--target). Our experiments on Javanese and Sundanese show that gains from additional context saturate quickly and can degrade near the maximum context window, with scaling behavior strongly dependent on corpus type. Notably, some forms of monolingual supervision can be competitive with parallel data, despite the latter offering additional supervision. Overall, our results characterize the effective limits and corpus-type sensitivity of long-context ICL for low-resource MT, highlighting that larger context windows do not necessarily yield proportional quality gains.

Beyond Many-Shot Translation: Scaling In-Context Demonstrations For Low-Resource Machine Translation

TL;DR

This study systematically scales in-context demonstrations for low-resource machine translation using ultra-long context decoders to 1M tokens, comparing monolingual, instruction-style, and parallel corpora for Javanese and Sundanese. It finds that translation gains saturate well before the maximum context and that performance can degrade near full context, with the magnitude and onset of degradation highly dependent on model architecture and data type. Notably, instruction-style monolingual data can rival parallel data in effectiveness, challenging assumptions that parallel corpora are always superior for in-context learning. The work highlights that larger context windows do not guarantee better MT and provides new datasets of synthetic parallel corpora to support future research in long-context prompting for low-resource languages.

Abstract

Building machine translation (MT) systems for low-resource languages is notably difficult due to the scarcity of high-quality data. Although Large Language Models (LLMs) have improved MT system performance, adapting them to lesser-represented languages remains challenging. In-context learning (ICL) may offer novel ways to adapt LLMs for low-resource MT by conditioning models on demonstration at inference time. In this study, we explore scaling low-resource machine translation ICL beyond the few-shot setting to thousands of examples with long-context models. We scale in-context token budget to 1M tokens and compare three types of training corpora used as in-context supervision: monolingual unsupervised data, instruction-style data, and parallel data (English--target and Indonesian--target). Our experiments on Javanese and Sundanese show that gains from additional context saturate quickly and can degrade near the maximum context window, with scaling behavior strongly dependent on corpus type. Notably, some forms of monolingual supervision can be competitive with parallel data, despite the latter offering additional supervision. Overall, our results characterize the effective limits and corpus-type sensitivity of long-context ICL for low-resource MT, highlighting that larger context windows do not necessarily yield proportional quality gains.
Paper Structure (23 sections, 18 figures, 2 tables)

This paper contains 23 sections, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Main prompt template used for evaluation. Different type of data used in shot samples will have their own formats, which is described in later sections.
  • Figure 2: Sample format for each instructions sample. The Alpaca dataset contains at least an instruction, input, and output. Each sample will be concatenated and put in the shot samples section of the main prompt.
  • Figure 3: Sample format for each parallel/supervised sample.
  • Figure 4: ChrF++ scores for two models and two language combinations. Each figure presented different lines indicating the different corpus types used as demonstrations during evaluation. We observe an initial increase in performance followed by a plateau, then a final drop. The Instruction type corpus (Purple) performs comparably to the supervised corpora despite being monolingual, showing that data presentation matters.
  • Figure 5: COMET scores for two models across two language combinations. Each subplot shows lines representing the different corpus types used as demonstrations during evaluation. We observe a pattern similar to the ChrF++ plot, an initial increase in performance, followed by a plateau, then a final drop. The relative ranking of corpus types is consistent with the ChrF++ metric.
  • ...and 13 more figures