Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs

Abdellah El Mekki; Muhammad Abdul-Mageed

Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs

Abdellah El Mekki, Muhammad Abdul-Mageed

TL;DR

The study tackles the scarcity of in-context MT exemplars for multilingual and low-resource languages by introducing a two-stage, unsupervised self-mining pipeline that first derives word-level translations to create synthetic data, then mines sentence-level ICL exemplars using a TopK+BM25 filter. It demonstrates that this unsupervised approach can achieve translation quality on par with, or exceeding, regular in-context learning while outperforming state-of-the-art UMT methods on FLORES-200 directions. The method relies on minimal unlabeled data and utilizes two multilingual LLMs (Llama-3 and Bloom) to show broad applicability across 288 language directions, with performance modulated by language resource level, script, and linguistic distance. These findings suggest a practical path to enabling high-quality MT for under-represented languages without annotated parallel data, while offering insights into the linguistic and model-dependent factors that govern ICL-based MT.

Abstract

Large Language Models (LLMs) have demonstrated impressive performance on a wide range of natural language processing (NLP) tasks, primarily through in-context learning (ICL). In ICL, the LLM is provided with examples that represent a given task such that it learns to generate answers for test inputs. However, access to these in-context examples is not guaranteed especially for low-resource or massively multilingual tasks. In this work, we propose an unsupervised approach to mine in-context examples for machine translation (MT), enabling unsupervised MT (UMT) across different languages. Our approach begins with word-level mining to acquire word translations that are then used to perform sentence-level mining. As the quality of mined parallel pairs may not be optimal due to noise or mistakes, we introduce a filtering criterion to select the optimal in-context examples from a pool of unsupervised parallel sentences. We evaluate our approach using two multilingual LLMs on 288 directions from the FLORES-200 dataset and analyze the impact of various linguistic features on performance. Our findings demonstrate the effectiveness of our unsupervised approach in mining in-context examples for MT, leading to better or comparable translation performance as translation with regular in-context samples (extracted from human-annotated data), while also outperforming the other state-of-the-art UMT methods by an average of $7$ BLEU points.

Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs

TL;DR

Abstract

Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)