Neural Text Normalization for Luxembourgish using Real-Life Variation Data
Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, Barbara Plank
TL;DR
This work tackles the challenge of Luxembourgish orthographic variation by introducing the first generative text normalization models trained on real-life variation data. It constructs a large parallel training set from Spellchecker.lu variants and Chamber of Deputies transcripts, then fine-tunes ByT5 and mT5 while benchmarking against GPT-4o, Llama, and Spellux. The evaluation combines quantitative metrics (ERR, CER, accuracy, recall, precision, F1) with a linguistically motivated qualitative test suite covering 21 rules. Findings show ByT5 base as the strongest among the tested sequence models, with GPT-4o performing strongly in some settings but facing reproducibility concerns, and Spellux providing a competitive baseline. The study demonstrates that real-life variation data enables effective Luxembourgish normalization and highlights strengths and trade-offs across byte-based, word-based, and pipeline approaches, with potential applicability to linguistic analysis and standardization efforts.
Abstract
Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.
