Table of Contents
Fetching ...

Testing Low-Resource Language Support in LLMs Using Language Proficiency Exams: the Case of Luxembourgish

Cedric Lothritz, Jordi Cabot, Laura Bernardy

TL;DR

This paper addresses the challenge of evaluating Luxembourgish, a low-resource language, by leveraging CEFR-aligned language proficiency exams and two LuxGen NLP-generation tasks in a zero-shot setting across 53 LLMs. It systematically compares model proficiency, analyzes error patterns by linguistic category, and investigates whether exam performance predicts NLG outcomes, using both traditional metrics and a novel LLM-as-a-Judge framework. The main findings show large models consistently outperform smaller ones, grammar remains the hardest category, and exam performance correlates with generation quality for LuxGen tasks, albeit with caveats related to task alignment and potential data leakage. The work contributes a public leaderboard, a transferable methodology for other low-resource languages, and practical guidance for choosing LLMs in Luxembourgish contexts, with implications for privacy-conscious deployments and multilingual evaluation.

Abstract

Large Language Models (LLMs) have become an increasingly important tool in research and society at large. While LLMs are regularly used all over the world by experts and lay-people alike, they are predominantly developed with English-speaking users in mind, performing well in English and other wide-spread languages while less-resourced languages such as Luxembourgish are seen as a lower priority. This lack of attention is also reflected in the sparsity of available evaluation tools and datasets. In this study, we investigate the viability of language proficiency exams as such evaluation tools for the Luxembourgish language. We find that large models such as Claude and DeepSeek-R1 typically achieve high scores, while smaller models show weak performances. We also find that the performances in such language exams can be used to predict performances in other NLP tasks in Luxembourgish.

Testing Low-Resource Language Support in LLMs Using Language Proficiency Exams: the Case of Luxembourgish

TL;DR

This paper addresses the challenge of evaluating Luxembourgish, a low-resource language, by leveraging CEFR-aligned language proficiency exams and two LuxGen NLP-generation tasks in a zero-shot setting across 53 LLMs. It systematically compares model proficiency, analyzes error patterns by linguistic category, and investigates whether exam performance predicts NLG outcomes, using both traditional metrics and a novel LLM-as-a-Judge framework. The main findings show large models consistently outperform smaller ones, grammar remains the hardest category, and exam performance correlates with generation quality for LuxGen tasks, albeit with caveats related to task alignment and potential data leakage. The work contributes a public leaderboard, a transferable methodology for other low-resource languages, and practical guidance for choosing LLMs in Luxembourgish contexts, with implications for privacy-conscious deployments and multilingual evaluation.

Abstract

Large Language Models (LLMs) have become an increasingly important tool in research and society at large. While LLMs are regularly used all over the world by experts and lay-people alike, they are predominantly developed with English-speaking users in mind, performing well in English and other wide-spread languages while less-resourced languages such as Luxembourgish are seen as a lower priority. This lack of attention is also reflected in the sparsity of available evaluation tools and datasets. In this study, we investigate the viability of language proficiency exams as such evaluation tools for the Luxembourgish language. We find that large models such as Claude and DeepSeek-R1 typically achieve high scores, while smaller models show weak performances. We also find that the performances in such language exams can be used to predict performances in other NLP tasks in Luxembourgish.

Paper Structure

This paper contains 33 sections, 5 figures, 14 tables.

Figures (5)

  • Figure 1: General testing pipeline.
  • Figure 2: Performance on language exams vs performance on the Headline Generation task.
  • Figure 3: Comparison between LLM-as-a-Judge metrics, BERTScore, and METEOR on the headline generation task. We are using a stacked line chart to better highlight the correlations between the metrics.
  • Figure 4: Comparison between LLM-as-a-Judge metrics, BERTScore, and METEOR on the short description task. We are using a stacked line chart to better highlight the correlations between the metrics.
  • Figure 5: Performance on Language Exams vs Performance on the Short Description Task.