Danoliteracy of Generative Large Language Models
Søren Vejlgaard Holm, Lars Kai Hansen, Martin Carsten Nielsen
TL;DR
This paper presents the Danoliterate Benchmark, an open-source, language-specific evaluation framework designed to assess Generative Large Language Models (GLLMs) on Danish across eight scenarios spanning real-world knowledge, natural language understanding, and natural language generation. It demonstrates that a small, curated Danish benchmark can yield robust model rankings that correlate strongly with human judgments ($\rho\sim0.8$) and reveal a single dominant Danoliteracy factor explaining most cross-scenario variance. The study finds that closed-weight, large, instruct-tuned models (e.g., GPT-4, Claude Opus) outperform open-weight models, and identifies a g-factor-like consistency across tasks, supporting the viability of language-focused benchmarks for low-resource settings. An open-source framework and live leaderboard enable ongoing evaluation, while ethical considerations and limitations are discussed to guide responsible deployment in Danish and other low-resource languages.
Abstract
The language technology moonshot moment of Generative Large Language Models (GLLMs) was not limited to English: These models brought a surge of technological applications, investments, and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were, until recently, difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate \emph{Danoliteracy}, a measure of Danish language and cultural competency across eight diverse scenarios such as Danish citizenship tests and abstractive social media question answering. This limited-size benchmark was found to produce a robust ranking that correlates to human feedback at $ρ\sim 0.8$ with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining $95\%$ of scenario performance variance for GLLMs in Danish, suggesting a $g$ factor of model consistency in language adaptation.
