In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation

Armel Zebaze; Benoît Sagot; Rachel Bawden

In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation

Armel Zebaze, Benoît Sagot, Rachel Bawden

TL;DR

This paper systematically investigates how in-context example selection via similarity search affects machine translation with large language models across language directions of varying resource levels. By benchmarking multiple multilingual sentence embeddings and a suite of open-access LLMs, the authors show that similarity-based retrieval yields meaningful MT gains, especially for low-resource languages, and that pool diversity and quality strongly influence results. They compare against BM25 and other baselines, highlighting SONAR as a consistently strong retriever and demonstrating robustness across pool compositions and model scales. The work also discusses prompting challenges in low-resource translation and proposes an evaluation protocol using laCOMET to better capture LLMed MT quality, providing a practical framework and code for the community.

Abstract

The ability of generative large language models (LLMs) to perform in-context learning has given rise to a large body of research into how best to prompt models for various natural language processing tasks. In this paper, we focus on machine translation (MT), a task that has been shown to benefit from in-context translation examples. However no systematic studies have been published on how best to select examples, and mixed results have been reported on the usefulness of similarity-based selection over random selection. We provide a study covering multiple LLMs and multiple in-context example retrieval strategies, comparing multilingual sentence embeddings. We cover several language directions, representing different levels of language resourcedness (English into French, German, Swahili and Wolof). Contrarily to previously published results, we find that sentence embedding similarity can improve MT, especially for low-resource language directions, and discuss the balance between selection pool diversity and quality. We also highlight potential problems with the evaluation of LLM-based MT and suggest a more appropriate evaluation protocol, adapting the COMET metric to the evaluation of LLMs. Code and outputs are freely available at https://github.com/ArmelRandy/ICL-MT.

In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation

TL;DR

Abstract

Paper Structure (33 sections, 6 figures, 19 tables)

This paper contains 33 sections, 6 figures, 19 tables.

Introduction
Background and Related Work
In-Context Learning (ICL).
Using LLMs for Machine Translation.
Similarity Search for Example Selection.
Example Retrieval via Similarity Search
Experimental Setup
Datasets
Retrievers
Models
Evaluation metrics
Experiments
Template selection
Benchmarking of example retrieval with multilingual sentence embeddings
Comparing to other approaches
...and 18 more sections

Figures (6)

Figure 1: An overview of example retrieval via similarity search for MT. $k$ sentences are first retrieved from the example pool (parallel corpus) based on their similarity to the source sentence. The retrieved sentence pairs are then assembled (as few-shot examples) with the source sentence into a prompt that is fed to a LLM for translation.
Figure 2: laCOMET scores for example retrieval with SONAR, BM25 and random sampling for various selection pool compositions for eng$\rightarrow$swh and BLOOM 7B1. The triangles correspond to the pool built either by shrinking $\mathcal{P}$ (taking the $N_1$ first pairs) or by extending it (with the $N_2$ first pairs of $\mathcal{U}$). The star indicates the initial pool, i.e. the entire FLORES-200 dev set.
Figure 3: For each pool composition involving FLORES and NLLB samples, the average number of the 10 in-context examples belong to the FLORES-200 dev set when using SONAR, BM25, and random sampling.
Figure 4: laCOMET scores of example retrieval with SONAR and BM25 compared to random sampling for the $k$-shot setting ($k \in \{1, 2, 5, 10, 20\}$) for eng$\rightarrow$swh and nine LLMs. Note that for readability reasons, the Y-axis scales of the figures are not aligned.
Figure 5: Average number of retrieved examples in common between sentence embedding methods (10-shot).
...and 1 more figures

In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation

TL;DR

Abstract

In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)