Text Generation Models for Luxembourgish with Limited Data: A Balanced Multilingual Strategy
Alistair Plum, Tharindu Ranasinghe, Christoph Purschke
TL;DR
The paper tackles data scarcity in Luxembourgish NLP by introducing a balanced multilingual pre-training strategy using Luxembourgish, German, and French data. It presents two T5-based models, LuxT5 and LuxT5-Grande, and a first Luxembourgish text-generation benchmark, LuxGen, with four generative tasks. Empirical results show that LuxT5-Grande consistently outperforms monolingual and some multilingual baselines on LuxGen, while manual evaluation highlights the limitations of BLEU for Luxembourgish and the value of data-balanced multilingual pre-training. The findings suggest that leveraging linguistically related neighbor languages can substantially boost generation for low-resource languages, with implications for other minority languages and dialects, and motivate further ablation studies.
Abstract
This paper addresses the challenges in developing language models for less-represented languages, with a focus on Luxembourgish. Despite its active development, Luxembourgish faces a digital data scarcity, exacerbated by Luxembourg's multilingual context. We propose a novel text generation model based on the T5 architecture, combining limited Luxembourgish data with equal amounts, in terms of size and type, of German and French data. We hypothesise that a model trained on Luxembourgish, German, and French will improve the model's cross-lingual transfer learning capabilities and outperform monolingual and large multilingual models. To verify this, the study at hand explores whether multilingual or monolingual training is more beneficial for Luxembourgish language generation. For the evaluation, we introduce LuxGen, a text generation benchmark that is the first of its kind for Luxembourgish.
