Table of Contents
Fetching ...

Language verY Rare for All

Ibrahim Merad, Amos Wolf, Ziad Mazzawi, Yannick Léo

TL;DR

This work tackles the task of machine translation for extremely low-resource languages by introducing LYRA, a single-GPU training framework that fuses open LLM fine-tuning, retrieval-augmented generation, and transfer learning from related high-resource languages. Focusing on French–Monégasque, LYRA constructs a parallel dataset from dictionaries, grammar texts, and literature and demonstrates that data standardization, RAG, and cross-language transfer collectively yield competitive translations with limited data. The study shows that LYRA can surpass or closely match state-of-the-art encoder–decoder models in rare language translation while remaining accessible on modest hardware. These results highlight a practical pathway for expanding multilingual translation to underrepresented languages and suggest directions for further gains through data augmentation and larger LLM fine-tuning.

Abstract

In the quest to overcome language barriers, encoder-decoder models like NLLB have expanded machine translation to rare languages, with some models (e.g., NLLB 1.3B) even trainable on a single GPU. While general-purpose LLMs perform well in translation, open LLMs prove highly competitive when fine-tuned for specific tasks involving unknown corpora. We introduce LYRA (Language verY Rare for All), a novel approach that combines open LLM fine-tuning, retrieval-augmented generation (RAG), and transfer learning from related high-resource languages. This study is exclusively focused on single-GPU training to facilitate ease of adoption. Our study focuses on two-way translation between French and Monégasque, a rare language unsupported by existing translation tools due to limited corpus availability. Our results demonstrate LYRA's effectiveness, frequently surpassing and consistently matching state-of-the-art encoder-decoder models in rare language translation.

Language verY Rare for All

TL;DR

This work tackles the task of machine translation for extremely low-resource languages by introducing LYRA, a single-GPU training framework that fuses open LLM fine-tuning, retrieval-augmented generation, and transfer learning from related high-resource languages. Focusing on French–Monégasque, LYRA constructs a parallel dataset from dictionaries, grammar texts, and literature and demonstrates that data standardization, RAG, and cross-language transfer collectively yield competitive translations with limited data. The study shows that LYRA can surpass or closely match state-of-the-art encoder–decoder models in rare language translation while remaining accessible on modest hardware. These results highlight a practical pathway for expanding multilingual translation to underrepresented languages and suggest directions for further gains through data augmentation and larger LLM fine-tuning.

Abstract

In the quest to overcome language barriers, encoder-decoder models like NLLB have expanded machine translation to rare languages, with some models (e.g., NLLB 1.3B) even trainable on a single GPU. While general-purpose LLMs perform well in translation, open LLMs prove highly competitive when fine-tuned for specific tasks involving unknown corpora. We introduce LYRA (Language verY Rare for All), a novel approach that combines open LLM fine-tuning, retrieval-augmented generation (RAG), and transfer learning from related high-resource languages. This study is exclusively focused on single-GPU training to facilitate ease of adoption. Our study focuses on two-way translation between French and Monégasque, a rare language unsupported by existing translation tools due to limited corpus availability. Our results demonstrate LYRA's effectiveness, frequently surpassing and consistently matching state-of-the-art encoder-decoder models in rare language translation.

Paper Structure

This paper contains 17 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of our method for building LYRA.
  • Figure 2: Comparison of models' translation performance in both directions in terms of BLEU scores before and after data standardization. The latter uniformly improves translation performance across all models.
  • Figure 3: Evolution of translation performance in both directions for the considered models through training epochs as measured by the BLEU score. The training of certain models was stopped early due to overfitting.