LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data
Julian Valline, Cedric Lothritz, Jordi Cabot
TL;DR
The paper addresses the scarcity of instruction-tuning data for Luxembourgish by introducing LuxIT, a monolingual dataset synthesized from RTL and Wikipedia using a Luxembourgish-proficient model and quality-controlled with an LLM-as-a-judge. The dataset comprises 59,242 high-quality instruction-answer pairs structured for conversational fine-tuning and is generated through a five-step pipeline from monolingual seed data. It presents a replicable methodology that avoids English-centric translation, contributing to Luxembourgish NLP and informing resource creation for other low-resource languages. Fine-tuning experiments on five small LLMs yield mixed improvements across language exams, highlighting the need for larger-scale data and further research to maximize benefits in low-resource settings.
Abstract
The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach. To investigate the practical utility of the dataset, we fine-tune several smaller-scale LLMs on LuxIT. Subsequent benchmarking against their base models on Luxembourgish language proficiency examinations, however, yields mixed results, with performance varying significantly across different models. LuxIT represents a critical contribution to Luxembourgish natural language processing and offers a replicable monolingual methodology, though our findings highlight the need for further research to optimize its application.
