Table of Contents
Fetching ...

GaelEval: Benchmarking LLM Performance for Scottish Gaelic

Peter Devine, William Lamb, Beatrice Alex, Ignatius Ezeani, Dawn Knight, Mícheál J. Ó Meachair, Paul Rayson, Martin Wynne

Abstract

Multilingual large language models (LLMs) often exhibit emergent 'shadow' capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge Q&A task. Evaluating 19 LLMs against a fluent-speaker human baseline ($n=30$), we find that Gemini 3 Pro Preview achieves $83.3\%$ accuracy on the linguistic task, surpassing the human baseline ($78.1\%$). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+$2.4\%$). On the cultural task, leading models exceed $90\%$ accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.

GaelEval: Benchmarking LLM Performance for Scottish Gaelic

Abstract

Multilingual large language models (LLMs) often exhibit emergent 'shadow' capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge Q&A task. Evaluating 19 LLMs against a fluent-speaker human baseline (), we find that Gemini 3 Pro Preview achieves accuracy on the linguistic task, surpassing the human baseline (). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+). On the cultural task, leading models exceed accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.

Paper Structure

This paper contains 26 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Linguistic competence accuracy by grammatical category for the human baseline and the top nine most-performant models (0--1 scale; darker = higher accuracy). ADJ = adjectives; ADV = adverbials; CFE = clefts and focussing expressions; COL = colours; CONJ = conjunctions and particles; DET = determiners; FORM = formulaic expressions; IPAS = impersonals and passives; NOM = nominal morphology; NUM = numerals; PREP = prepositions; PRO = pronouns and anaphor resolution; QUES = questions and tags; REL = relative clauses; TAM = Tense-Aspect-Modality system; VNC = verbal noun cores. Means are cross-category and so differ from those in Table \ref{['tab:manualqa_accuracy']}. English prompting conditions used.