Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish
Lujun Li, Yewei Song, Lama Sleem, Yiqun Wang, Yangjie Xu, Cedric Lothritz, Niccolo Gentile, Radu State, Tegawende F. Bissyande, Jacques Klein
TL;DR
This paper proposes a Grammar-Book–Guided evaluation pipeline to probe the grammatical competence of large language models in Luxembourgish, a low-resource language. By grounding evaluations in an actual grammar book and a structured four-component pipeline (Material Inspector, Phrasing Atelier, Twin Forge, Proof Stand), the authors move beyond surface translation metrics to assess explicit grammatical understanding. Across 4 tasks derived from grammar points, they observe a weak overall correlation between translation quality and grammatical competence, with larger models improving translation more than grammar and reasoning abilities playing a key role in grammatical understanding. The work demonstrates that while LLMs exhibit some grammar knowledge through extensive pre-training, achieving deep, rule-governed grammatical mastery remains challenging and relies on model capacity, reasoning, and targeted training; it further establishes a framework and dataset for systematic grammar evaluation in low-resource languages, enabling future cross-linguistic benchmarking and method refinement.
Abstract
Grammar refers to the system of rules that governs the structural organization and the semantic relations among linguistic units such as sentences, phrases, and words within a given language. In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages. Moreover, the extent to which large language models genuinely comprehend grammatical structure, especially the mapping between syntactic structures and meanings, remains under debate. To investigate this issue, we propose a Grammar Book Guided evaluation pipeline intended to provide a systematic and generalizable framework for grammar evaluation consisting of four key stages, and in this work we take Luxembourgish as a case study. The results show a weak positive correlation between translation performance and grammatical understanding, indicating that strong translations do not necessarily imply deep grammatical competence. Larger models perform well overall due to their semantic strength but remain weak in morphology and syntax, struggling particularly with Minimal Pair tasks, while strong reasoning ability offers a promising way to enhance their grammatical understanding.
