Table of Contents
Fetching ...

DaLA: Danish Linguistic Acceptability Evaluation Guided by Real World Errors

Gianluca Barmina, Nathalie Carmen Hau Norman, Peter Schneider-Kamp, Lukas Galke Poech

TL;DR

DaLA introduces a Danish linguistic acceptability benchmark generated by 14 corruption types derived from real-world Danish errors, applied to UD Danish sentences. It combines automatic and manual validation to ensure corruption quality and demonstrates that DaLA poses a tougher, more discriminative challenge for open-source and open-weight LLMs than the prior ScaLA benchmark. The approach yields robust, expandable datasets (training/validation/test with 3,328 samples, extendable to 7,656) and shows improved ability to differentiate model performance. This provides a more realistic, language-specific evaluation framework with potential applicability to other languages.

Abstract

We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.

DaLA: Danish Linguistic Acceptability Evaluation Guided by Real World Errors

TL;DR

DaLA introduces a Danish linguistic acceptability benchmark generated by 14 corruption types derived from real-world Danish errors, applied to UD Danish sentences. It combines automatic and manual validation to ensure corruption quality and demonstrates that DaLA poses a tougher, more discriminative challenge for open-source and open-weight LLMs than the prior ScaLA benchmark. The approach yields robust, expandable datasets (training/validation/test with 3,328 samples, extendable to 7,656) and shows improved ability to differentiate model performance. This provides a more realistic, language-specific evaluation framework with potential applicability to other languages.

Abstract

We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.

Paper Structure

This paper contains 27 sections, 2 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Left: Overview of DaLA creation method, including automatic and human corruption-quality validation, as well as LLM evaluation. Right: Comparison of LLM performance on ScaLA and DaLA, measured with the Matthews Correlation Coefficient (higher is better). LLMs perform worse on DaLA, indicating increased task difficulty.
  • Figure 2: Proportion of corruptible examples among all Universal Dependencies samples (blue circles) vs proportion of actually corrupted examples in training set (red crosses).