Table of Contents
Fetching ...

Evaluating the Translation Performance of Large Language Models Based on Euas-20

Yan Huang, Wei Liu

TL;DR

The paper introduces Euas-20, a 20-language dataset designed to rigorously evaluate large language models on translation tasks, analyzed through zero-shot prompts and BLEU/COMET metrics across nine contemporary LLMs. It reveals rapid translation performance gains with larger, more diverse pretraining data, while also highlighting persistent imbalances across languages and the prevalence of translation illusions, especially for non-English languages. The study shows that multilingual, high-quality corpora substantially boost translation ability, and that models still struggle with low-resource languages and unregistered words, underscoring the need for broader multilingual data and improved evaluation probes. Collectively, the work provides practical guidance for researchers and developers on dataset design, model training priorities, and evaluation strategies to advance reliable cross-lingual translation with LLMs.

Abstract

In recent years, with the rapid development of deep learning technology, large language models (LLMs) such as BERT and GPT have achieved breakthrough results in natural language processing tasks. Machine translation (MT), as one of the core tasks of natural language processing, has also benefited from the development of large language models and achieved a qualitative leap. Despite the significant progress in translation performance achieved by large language models, machine translation still faces many challenges. Therefore, in this paper, we construct the dataset Euas-20 to evaluate the performance of large language models on translation tasks, the translation ability on different languages, and the effect of pre-training data on the translation ability of LLMs for researchers and developers.

Evaluating the Translation Performance of Large Language Models Based on Euas-20

TL;DR

The paper introduces Euas-20, a 20-language dataset designed to rigorously evaluate large language models on translation tasks, analyzed through zero-shot prompts and BLEU/COMET metrics across nine contemporary LLMs. It reveals rapid translation performance gains with larger, more diverse pretraining data, while also highlighting persistent imbalances across languages and the prevalence of translation illusions, especially for non-English languages. The study shows that multilingual, high-quality corpora substantially boost translation ability, and that models still struggle with low-resource languages and unregistered words, underscoring the need for broader multilingual data and improved evaluation probes. Collectively, the work provides practical guidance for researchers and developers on dataset design, model training priorities, and evaluation strategies to advance reliable cross-lingual translation with LLMs.

Abstract

In recent years, with the rapid development of deep learning technology, large language models (LLMs) such as BERT and GPT have achieved breakthrough results in natural language processing tasks. Machine translation (MT), as one of the core tasks of natural language processing, has also benefited from the development of large language models and achieved a qualitative leap. Despite the significant progress in translation performance achieved by large language models, machine translation still faces many challenges. Therefore, in this paper, we construct the dataset Euas-20 to evaluate the performance of large language models on translation tasks, the translation ability on different languages, and the effect of pre-training data on the translation ability of LLMs for researchers and developers.
Paper Structure (18 sections, 5 figures, 3 tables)

This paper contains 18 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Prompt 1
  • Figure 2: Prompt 2
  • Figure 3: BLEU and COMET scores for nine LLMs translations centred on English and Chinese.
  • Figure 4: Translation performance (BLEU) of LLMS on our evaluated languages, ‘xx-en’ and ‘xx-zh’ denote translation from other languages to English and Chinese, respectively.
  • Figure 5: Corpus share of LLMs