Evaluating the Translation Performance of Large Language Models Based on Euas-20

Yan Huang; Wei Liu

Evaluating the Translation Performance of Large Language Models Based on Euas-20

Yan Huang, Wei Liu

TL;DR

The paper introduces Euas-20, a 20-language dataset designed to rigorously evaluate large language models on translation tasks, analyzed through zero-shot prompts and BLEU/COMET metrics across nine contemporary LLMs. It reveals rapid translation performance gains with larger, more diverse pretraining data, while also highlighting persistent imbalances across languages and the prevalence of translation illusions, especially for non-English languages. The study shows that multilingual, high-quality corpora substantially boost translation ability, and that models still struggle with low-resource languages and unregistered words, underscoring the need for broader multilingual data and improved evaluation probes. Collectively, the work provides practical guidance for researchers and developers on dataset design, model training priorities, and evaluation strategies to advance reliable cross-lingual translation with LLMs.

Abstract

In recent years, with the rapid development of deep learning technology, large language models (LLMs) such as BERT and GPT have achieved breakthrough results in natural language processing tasks. Machine translation (MT), as one of the core tasks of natural language processing, has also benefited from the development of large language models and achieved a qualitative leap. Despite the significant progress in translation performance achieved by large language models, machine translation still faces many challenges. Therefore, in this paper, we construct the dataset Euas-20 to evaluate the performance of large language models on translation tasks, the translation ability on different languages, and the effect of pre-training data on the translation ability of LLMs for researchers and developers.

Evaluating the Translation Performance of Large Language Models Based on Euas-20

TL;DR

Abstract

Paper Structure (18 sections, 5 figures, 3 tables)

This paper contains 18 sections, 5 figures, 3 tables.

Introduction
Background
Large Language Models
Machine Translation
Experimental Setup
Dataset
LLMs
Evaluation Methods
Evaluation Indicators
Testing of machine translation for LLMs
Continuous Improvement of Translation Ability of LLMs
Translation performance of LLMSs across languages
Effect of corpus on the translation performance of LLMs
Illusions in the translation of LLMs
Translation words that LLMs tend to choose in translation tasks
...and 3 more sections

Figures (5)

Figure 1: Prompt 1
Figure 2: Prompt 2
Figure 3: BLEU and COMET scores for nine LLMs translations centred on English and Chinese.
Figure 4: Translation performance (BLEU) of LLMS on our evaluated languages, ‘xx-en’ and ‘xx-zh’ denote translation from other languages to English and Chinese, respectively.
Figure 5: Corpus share of LLMs

Evaluating the Translation Performance of Large Language Models Based on Euas-20

TL;DR

Abstract

Evaluating the Translation Performance of Large Language Models Based on Euas-20

Authors

TL;DR

Abstract

Table of Contents

Figures (5)