Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Finnur Ágúst Ingimundarson; Steinunn Rut Friðriksdóttir; Bjarki Ármannsson; Iris Edda Nowenstein; Steinþór Steingrímsson

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Finnur Ágúst Ingimundarson, Steinunn Rut Friðriksdóttir, Bjarki Ármannsson, Iris Edda Nowenstein, Steinþór Steingrímsson

Abstract

This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests' validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Abstract

Paper Structure (19 sections, 2 figures, 2 tables)

This paper contains 19 sections, 2 figures, 2 tables.

Introduction
Related Work
Leaderboards for Icelandic
Miðeind's leaderboard
EuroEval's leaderboard
Quantitative Error Analysis
Results
Discussion
Machine-translated benchmarks
No native speaker involvement
MT without native speaker involvement
Typos, errors and whether they matter
Beyond single-label output, towards more diverse benchmarks
Conclusion
Limitations
...and 4 more sections

Figures (2)

Figure 1: Mean proportion and 95% confidence interval for each of the labels across annotations and benchmarks, arranged by OK proportion in descending order
Figure 2: Label proportions per rater across the evaluated benchmarks

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Abstract

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Authors

Abstract

Table of Contents

Figures (2)