Table of Contents
Fetching ...

Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices

Špela Vintar, Taja Kuzman Pungeršek, Mojca Brglez, Nikola Ljubešić

TL;DR

The paper addresses the need for robust evaluation of LLMs in non-English European languages and the biases of English-centric benchmarks. It introduces a new taxonomy for benchmarking that integrates language capabilities, multilinguality, speech and culture competence, and proposes a European benchmarking registry with rich provenance and metadata. It reviews current benchmarks (major, multilingual, dynamic, culture-specific) and outlines best practices for provenance, accessibility, language coverage, and evaluation metrics. The work aims to guide more culturally aware, language-sensitive, and transparent evaluation methods that can reduce Western-centric biases and support non-English LLM development.

Abstract

While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.

Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices

TL;DR

The paper addresses the need for robust evaluation of LLMs in non-English European languages and the biases of English-centric benchmarks. It introduces a new taxonomy for benchmarking that integrates language capabilities, multilinguality, speech and culture competence, and proposes a European benchmarking registry with rich provenance and metadata. It reviews current benchmarks (major, multilingual, dynamic, culture-specific) and outlines best practices for provenance, accessibility, language coverage, and evaluation metrics. The work aims to guide more culturally aware, language-sensitive, and transparent evaluation methods that can reduce Western-centric biases and support non-English LLM development.

Abstract

While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.

Paper Structure

This paper contains 20 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Top-level categories with subcategories.