Table of Contents
Fetching ...

Danoliteracy of Generative Large Language Models

Søren Vejlgaard Holm, Lars Kai Hansen, Martin Carsten Nielsen

TL;DR

This paper presents the Danoliterate Benchmark, an open-source, language-specific evaluation framework designed to assess Generative Large Language Models (GLLMs) on Danish across eight scenarios spanning real-world knowledge, natural language understanding, and natural language generation. It demonstrates that a small, curated Danish benchmark can yield robust model rankings that correlate strongly with human judgments ($\rho\sim0.8$) and reveal a single dominant Danoliteracy factor explaining most cross-scenario variance. The study finds that closed-weight, large, instruct-tuned models (e.g., GPT-4, Claude Opus) outperform open-weight models, and identifies a g-factor-like consistency across tasks, supporting the viability of language-focused benchmarks for low-resource settings. An open-source framework and live leaderboard enable ongoing evaluation, while ethical considerations and limitations are discussed to guide responsible deployment in Danish and other low-resource languages.

Abstract

The language technology moonshot moment of Generative Large Language Models (GLLMs) was not limited to English: These models brought a surge of technological applications, investments, and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were, until recently, difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate \emph{Danoliteracy}, a measure of Danish language and cultural competency across eight diverse scenarios such as Danish citizenship tests and abstractive social media question answering. This limited-size benchmark was found to produce a robust ranking that correlates to human feedback at $ρ\sim 0.8$ with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining $95\%$ of scenario performance variance for GLLMs in Danish, suggesting a $g$ factor of model consistency in language adaptation.

Danoliteracy of Generative Large Language Models

TL;DR

This paper presents the Danoliterate Benchmark, an open-source, language-specific evaluation framework designed to assess Generative Large Language Models (GLLMs) on Danish across eight scenarios spanning real-world knowledge, natural language understanding, and natural language generation. It demonstrates that a small, curated Danish benchmark can yield robust model rankings that correlate strongly with human judgments () and reveal a single dominant Danoliteracy factor explaining most cross-scenario variance. The study finds that closed-weight, large, instruct-tuned models (e.g., GPT-4, Claude Opus) outperform open-weight models, and identifies a g-factor-like consistency across tasks, supporting the viability of language-focused benchmarks for low-resource settings. An open-source framework and live leaderboard enable ongoing evaluation, while ethical considerations and limitations are discussed to guide responsible deployment in Danish and other low-resource languages.

Abstract

The language technology moonshot moment of Generative Large Language Models (GLLMs) was not limited to English: These models brought a surge of technological applications, investments, and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were, until recently, difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate \emph{Danoliteracy}, a measure of Danish language and cultural competency across eight diverse scenarios such as Danish citizenship tests and abstractive social media question answering. This limited-size benchmark was found to produce a robust ranking that correlates to human feedback at with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining of scenario performance variance for GLLMs in Danish, suggesting a factor of model consistency in language adaptation.

Paper Structure

This paper contains 40 sections, 2 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: The overall evaluation setup: A collection of GLLMs, including closed-source (lock symbol) instruct-tuned (bulls-eye) and multilingual (globe) ones, were evaluated in Danish across diverse use-case scenarios.
  • Figure 2: Selected model normalized results across the eight scenarios divided into three categories as described in Section \ref{['sec:datasets']}. Claude Opus is overtaken by GPT-4 on the NER task but wins on an NLG task. LlaMa 3 70B, the SOTA open-weights model, lags behind on NLU and knowledge-based tasks. A Danish-specialized model with only 1.1B parameters, DanskGPT-tiny Chat, benchmarks well in NLG but fails on knowledge and understanding.
  • Figure 3: The non-normalized metric scores across evaluation scenarios for two models that were judged highly according to human feedback. Uncertainties are 95% confidence intervals according to the bootstrapping procedure and the micro-average is displayed for each model. k
  • Figure 4: The general prompting approach translated to English.
  • Figure 5: Model Danoliteracy Index across all scenarios for top performers. Two model nodes are connected iff the bootstrapping procedure could not reveal significant benchmark performance difference at $\alpha=0.05$. Together with the special o1 model, Claude Opus and the GPT 4 family models are consistent winners.
  • ...and 8 more figures