Table of Contents
Fetching ...

Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs

Abdellah El Mekki, Muhammad Abdul-Mageed

TL;DR

The study tackles the scarcity of in-context MT exemplars for multilingual and low-resource languages by introducing a two-stage, unsupervised self-mining pipeline that first derives word-level translations to create synthetic data, then mines sentence-level ICL exemplars using a TopK+BM25 filter. It demonstrates that this unsupervised approach can achieve translation quality on par with, or exceeding, regular in-context learning while outperforming state-of-the-art UMT methods on FLORES-200 directions. The method relies on minimal unlabeled data and utilizes two multilingual LLMs (Llama-3 and Bloom) to show broad applicability across 288 language directions, with performance modulated by language resource level, script, and linguistic distance. These findings suggest a practical path to enabling high-quality MT for under-represented languages without annotated parallel data, while offering insights into the linguistic and model-dependent factors that govern ICL-based MT.

Abstract

Large Language Models (LLMs) have demonstrated impressive performance on a wide range of natural language processing (NLP) tasks, primarily through in-context learning (ICL). In ICL, the LLM is provided with examples that represent a given task such that it learns to generate answers for test inputs. However, access to these in-context examples is not guaranteed especially for low-resource or massively multilingual tasks. In this work, we propose an unsupervised approach to mine in-context examples for machine translation (MT), enabling unsupervised MT (UMT) across different languages. Our approach begins with word-level mining to acquire word translations that are then used to perform sentence-level mining. As the quality of mined parallel pairs may not be optimal due to noise or mistakes, we introduce a filtering criterion to select the optimal in-context examples from a pool of unsupervised parallel sentences. We evaluate our approach using two multilingual LLMs on 288 directions from the FLORES-200 dataset and analyze the impact of various linguistic features on performance. Our findings demonstrate the effectiveness of our unsupervised approach in mining in-context examples for MT, leading to better or comparable translation performance as translation with regular in-context samples (extracted from human-annotated data), while also outperforming the other state-of-the-art UMT methods by an average of $7$ BLEU points.

Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs

TL;DR

The study tackles the scarcity of in-context MT exemplars for multilingual and low-resource languages by introducing a two-stage, unsupervised self-mining pipeline that first derives word-level translations to create synthetic data, then mines sentence-level ICL exemplars using a TopK+BM25 filter. It demonstrates that this unsupervised approach can achieve translation quality on par with, or exceeding, regular in-context learning while outperforming state-of-the-art UMT methods on FLORES-200 directions. The method relies on minimal unlabeled data and utilizes two multilingual LLMs (Llama-3 and Bloom) to show broad applicability across 288 language directions, with performance modulated by language resource level, script, and linguistic distance. These findings suggest a practical path to enabling high-quality MT for under-represented languages without annotated parallel data, while offering insights into the linguistic and model-dependent factors that govern ICL-based MT.

Abstract

Large Language Models (LLMs) have demonstrated impressive performance on a wide range of natural language processing (NLP) tasks, primarily through in-context learning (ICL). In ICL, the LLM is provided with examples that represent a given task such that it learns to generate answers for test inputs. However, access to these in-context examples is not guaranteed especially for low-resource or massively multilingual tasks. In this work, we propose an unsupervised approach to mine in-context examples for machine translation (MT), enabling unsupervised MT (UMT) across different languages. Our approach begins with word-level mining to acquire word translations that are then used to perform sentence-level mining. As the quality of mined parallel pairs may not be optimal due to noise or mistakes, we introduce a filtering criterion to select the optimal in-context examples from a pool of unsupervised parallel sentences. We evaluate our approach using two multilingual LLMs on 288 directions from the FLORES-200 dataset and analyze the impact of various linguistic features on performance. Our findings demonstrate the effectiveness of our unsupervised approach in mining in-context examples for MT, leading to better or comparable translation performance as translation with regular in-context samples (extracted from human-annotated data), while also outperforming the other state-of-the-art UMT methods by an average of BLEU points.

Paper Structure

This paper contains 43 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Mean spBLEU TopK+BM25 scores using Llama-3, showing MT performance across resource levels. (Bloom results in Appendix \ref{['sec:app_anal']}, Figure \ref{['fig:spbleu_res_lev_agg_bloom']}).
  • Figure 2: Comparison of the spBLEU score of our unsupervised TopK+BM25 using Llama-3 and Bloom for a randomly selected subset of language pairs.
  • Figure 3: Average spBLEU scores from TopK+BM25 experiments across different resource levels for different language pairs using Bloom. Each cell represents the mean spBLEU score for translations from a source resource level to a target resource level.
  • Figure 4: The impact of multiple iterations of our approach using TopK+BM25 on the spBLEU score for a subset of language pairs using Llama-3.
  • Figure 5: The impact of the quantity of ICL examples on the spBLEU score for UMT employing our TopK+BM25 approach with Llama-3.
  • ...and 2 more figures