Table of Contents
Fetching ...

LuxMT Technical Report

Nils Rehlinger

TL;DR

LuxMT presents a targeted Luxembourgish→French and Luxembourgish→English translation system by fine-tuning Gemma 3 on a carefully curated LB multilingual corpus. A Luci-based benchmark is built to avoid training-data contamination and to evaluate across LB→FR, LB→EN, and LB→DE, with data filtered by Luxembourgish sentence embeddings (LuxEmbedder). The study demonstrates substantial gains over the Gemma 3 baseline, including cross-lingual improvements to LB→DE without DE data, and investigates LuxEmbedder as a potential quality-estimation metric, while acknowledging limitations related to benchmark scope and metric validity. The work contributes a practical LB MT pipeline, a reproducible benchmark framework, and insights on data filtering and QE signals for low-resource languages, with plans to extend to more data and languages in the future.

Abstract

We introduce LuxMT, a machine translation system based on Gemma 3 27B and fine-tuned for translation from Luxembourgish (LB) into French (FR) and English (EN). To assess translation performance, we construct a novel benchmark covering LB-FR, LB-EN, and LB-FR using human-translated data from Luci, a tourist magazine about Luxembourg. Training data stems from LuxAlign, a parallel corpus of multilingual Luxembourgish news articles, and LB parliamentary transcripts augmented with Google Translate. We filter the data using LuxEmbedder, LB sentence embeddings, to remove low-equivalence segment-pairs. Overall, LuxMT's results suggest strong improvements over the Gemma 3 baseline, even for translating LB to German (DE), despite the training data not containing any DE. We also explore LuxEmbedder's potential to be used as a quality estimation metric and find strong correlations with other reference-based metrics. However, we call for further research to fully assess the metric's utility and advise using it with caution.

LuxMT Technical Report

TL;DR

LuxMT presents a targeted Luxembourgish→French and Luxembourgish→English translation system by fine-tuning Gemma 3 on a carefully curated LB multilingual corpus. A Luci-based benchmark is built to avoid training-data contamination and to evaluate across LB→FR, LB→EN, and LB→DE, with data filtered by Luxembourgish sentence embeddings (LuxEmbedder). The study demonstrates substantial gains over the Gemma 3 baseline, including cross-lingual improvements to LB→DE without DE data, and investigates LuxEmbedder as a potential quality-estimation metric, while acknowledging limitations related to benchmark scope and metric validity. The work contributes a practical LB MT pipeline, a reproducible benchmark framework, and insights on data filtering and QE signals for low-resource languages, with plans to extend to more data and languages in the future.

Abstract

We introduce LuxMT, a machine translation system based on Gemma 3 27B and fine-tuned for translation from Luxembourgish (LB) into French (FR) and English (EN). To assess translation performance, we construct a novel benchmark covering LB-FR, LB-EN, and LB-FR using human-translated data from Luci, a tourist magazine about Luxembourg. Training data stems from LuxAlign, a parallel corpus of multilingual Luxembourgish news articles, and LB parliamentary transcripts augmented with Google Translate. We filter the data using LuxEmbedder, LB sentence embeddings, to remove low-equivalence segment-pairs. Overall, LuxMT's results suggest strong improvements over the Gemma 3 baseline, even for translating LB to German (DE), despite the training data not containing any DE. We also explore LuxEmbedder's potential to be used as a quality estimation metric and find strong correlations with other reference-based metrics. However, we call for further research to fully assess the metric's utility and advise using it with caution.
Paper Structure (24 sections, 1 figure, 4 tables)

This paper contains 24 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Data mixture for MT fine-tuning. LB$\rightarrow$FR total: 32k; LB$\rightarrow$EN total: 22.5k; Total: 54.5k.