Language verY Rare for All
Ibrahim Merad, Amos Wolf, Ziad Mazzawi, Yannick Léo
TL;DR
This work tackles the task of machine translation for extremely low-resource languages by introducing LYRA, a single-GPU training framework that fuses open LLM fine-tuning, retrieval-augmented generation, and transfer learning from related high-resource languages. Focusing on French–Monégasque, LYRA constructs a parallel dataset from dictionaries, grammar texts, and literature and demonstrates that data standardization, RAG, and cross-language transfer collectively yield competitive translations with limited data. The study shows that LYRA can surpass or closely match state-of-the-art encoder–decoder models in rare language translation while remaining accessible on modest hardware. These results highlight a practical pathway for expanding multilingual translation to underrepresented languages and suggest directions for further gains through data augmentation and larger LLM fine-tuning.
Abstract
In the quest to overcome language barriers, encoder-decoder models like NLLB have expanded machine translation to rare languages, with some models (e.g., NLLB 1.3B) even trainable on a single GPU. While general-purpose LLMs perform well in translation, open LLMs prove highly competitive when fine-tuned for specific tasks involving unknown corpora. We introduce LYRA (Language verY Rare for All), a novel approach that combines open LLM fine-tuning, retrieval-augmented generation (RAG), and transfer learning from related high-resource languages. This study is exclusively focused on single-GPU training to facilitate ease of adoption. Our study focuses on two-way translation between French and Monégasque, a rare language unsupported by existing translation tools due to limited corpus availability. Our results demonstrate LYRA's effectiveness, frequently surpassing and consistently matching state-of-the-art encoder-decoder models in rare language translation.
