Table of Contents
Fetching ...

Multilingual European Language Models: Benchmarking Approaches and Challenges

Fabio Barth, Georg Rehm

TL;DR

This paper tackles the inadequacy of English-centric benchmarks for evaluating multilingual European LLMs. It analyzes seven multilingual benchmarks, classifies development approaches into translated English benchmarks and native multilingual datasets, and identifies four central challenges: cross-lingual comparison, translationese, cultural bias, and data quality. The authors discuss remedies including human-in-the-loop verification and iterative translation ranking, and advocate for culturally aware, rigorously validated benchmarks. The work highlights the practical importance of designing equitable benchmarks to accurately assess multilingual reasoning and QA across European languages.

Abstract

The breakthrough of generative large language models (LLMs) that can solve different tasks through chat interaction has led to a significant increase in the use of general benchmarks to assess the quality or performance of these models beyond individual applications. There is also a need for better methods to evaluate and also to compare models due to the ever increasing number of new models published. However, most of the established benchmarks revolve around the English language. This paper analyses the benefits and limitations of current evaluation datasets, focusing on multilingual European benchmarks. We analyse seven multilingual benchmarks and identify four major challenges. Furthermore, we discuss potential solutions to enhance translation quality and mitigate cultural biases, including human-in-the-loop verification and iterative translation ranking. Our analysis highlights the need for culturally aware and rigorously validated benchmarks to assess the reasoning and question-answering capabilities of multilingual LLMs accurately.

Multilingual European Language Models: Benchmarking Approaches and Challenges

TL;DR

This paper tackles the inadequacy of English-centric benchmarks for evaluating multilingual European LLMs. It analyzes seven multilingual benchmarks, classifies development approaches into translated English benchmarks and native multilingual datasets, and identifies four central challenges: cross-lingual comparison, translationese, cultural bias, and data quality. The authors discuss remedies including human-in-the-loop verification and iterative translation ranking, and advocate for culturally aware, rigorously validated benchmarks. The work highlights the practical importance of designing equitable benchmarks to accurately assess multilingual reasoning and QA across European languages.

Abstract

The breakthrough of generative large language models (LLMs) that can solve different tasks through chat interaction has led to a significant increase in the use of general benchmarks to assess the quality or performance of these models beyond individual applications. There is also a need for better methods to evaluate and also to compare models due to the ever increasing number of new models published. However, most of the established benchmarks revolve around the English language. This paper analyses the benefits and limitations of current evaluation datasets, focusing on multilingual European benchmarks. We analyse seven multilingual benchmarks and identify four major challenges. Furthermore, we discuss potential solutions to enhance translation quality and mitigate cultural biases, including human-in-the-loop verification and iterative translation ranking. Our analysis highlights the need for culturally aware and rigorously validated benchmarks to assess the reasoning and question-answering capabilities of multilingual LLMs accurately.

Paper Structure

This paper contains 10 sections.