Table of Contents
Fetching ...

Low-Resource Machine Translation through Retrieval-Augmented LLM Prompting: A Study on the Mambai Language

Raphaël Merx, Aso Mahmudi, Katrina Langford, Leo Alberto de Araujo, Ekaterina Vylomova

TL;DR

This work tackles English-to-Mambai translation in a very low-resource setting by leveraging retrieval-augmented prompting with a bilingual dictionary and a parallel-sentence corpus derived from the Mambai Language Manual. The study compares open-source and proprietary LLMs (LlaMa 70b, Mixtral 8x7B, GPT-4) and finds that including dictionary entries and a blend of TF-IDF and semantic-embedding retrieved sentences can substantially boost translation quality, achieving up to $BLEU$ = 23.5 on a manual-derived test set but only $BLEU$ = 4.4 on a native-speaker test set, underscoring domain- and data-diversity issues for low-resource MT. A key contribution is the release of an initial Mambai corpus, including bilingual dictionaries in both directions and a 1,187-sentence parallel corpus, which highlights the variability of MT performance across test domains and the need for representative evaluation data. The results illustrate both the promise and the limitations of retrieval-augmented prompting for very low-resource languages and provide practical guidelines for constructing prompts that surface domain-relevant content.

Abstract

This study explores the use of large language models (LLMs) for translating English into Mambai, a low-resource Austronesian language spoken in Timor-Leste, with approximately 200,000 native speakers. Leveraging a novel corpus derived from a Mambai language manual and additional sentences translated by a native speaker, we examine the efficacy of few-shot LLM prompting for machine translation (MT) in this low-resource context. Our methodology involves the strategic selection of parallel sentences and dictionary entries for prompting, aiming to enhance translation accuracy, using open-source and proprietary LLMs (LlaMa 2 70b, Mixtral 8x7B, GPT-4). We find that including dictionary entries in prompts and a mix of sentences retrieved through TF-IDF and semantic embeddings significantly improves translation quality. However, our findings reveal stark disparities in translation performance across test sets, with BLEU scores reaching as high as 21.2 on materials from the language manual, in contrast to a maximum of 4.4 on a test set provided by a native speaker. These results underscore the importance of diverse and representative corpora in assessing MT for low-resource languages. Our research provides insights into few-shot LLM prompting for low-resource MT, and makes available an initial corpus for the Mambai language.

Low-Resource Machine Translation through Retrieval-Augmented LLM Prompting: A Study on the Mambai Language

TL;DR

This work tackles English-to-Mambai translation in a very low-resource setting by leveraging retrieval-augmented prompting with a bilingual dictionary and a parallel-sentence corpus derived from the Mambai Language Manual. The study compares open-source and proprietary LLMs (LlaMa 70b, Mixtral 8x7B, GPT-4) and finds that including dictionary entries and a blend of TF-IDF and semantic-embedding retrieved sentences can substantially boost translation quality, achieving up to = 23.5 on a manual-derived test set but only = 4.4 on a native-speaker test set, underscoring domain- and data-diversity issues for low-resource MT. A key contribution is the release of an initial Mambai corpus, including bilingual dictionaries in both directions and a 1,187-sentence parallel corpus, which highlights the variability of MT performance across test domains and the need for representative evaluation data. The results illustrate both the promise and the limitations of retrieval-augmented prompting for very low-resource languages and provide practical guidelines for constructing prompts that surface domain-relevant content.

Abstract

This study explores the use of large language models (LLMs) for translating English into Mambai, a low-resource Austronesian language spoken in Timor-Leste, with approximately 200,000 native speakers. Leveraging a novel corpus derived from a Mambai language manual and additional sentences translated by a native speaker, we examine the efficacy of few-shot LLM prompting for machine translation (MT) in this low-resource context. Our methodology involves the strategic selection of parallel sentences and dictionary entries for prompting, aiming to enhance translation accuracy, using open-source and proprietary LLMs (LlaMa 2 70b, Mixtral 8x7B, GPT-4). We find that including dictionary entries in prompts and a mix of sentences retrieved through TF-IDF and semantic embeddings significantly improves translation quality. However, our findings reveal stark disparities in translation performance across test sets, with BLEU scores reaching as high as 21.2 on materials from the language manual, in contrast to a maximum of 4.4 on a test set provided by a native speaker. These results underscore the importance of diverse and representative corpora in assessing MT for low-resource languages. Our research provides insights into few-shot LLM prompting for low-resource MT, and makes available an initial corpus for the Mambai language.
Paper Structure (17 sections, 3 figures, 3 tables)

This paper contains 17 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our process for extracting dictionaries and a parallel corpus from the Mambai Language Manual
  • Figure 2: Mambai configuration in ABBYY FineReader 15.
  • Figure 3: Overview of our process for translating English sentences to Mambai using both dictionary entries and sentence pairs in few-shot LLM prompting.