Table of Contents
Fetching ...

Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

Benjamin Marie, Atsushi Fujita, Raphael Rubino

TL;DR

The paper conducts the first large-scale meta-evaluation of MT evaluation practices across 769 papers (2010–2020) and reveals pervasive BLEU-centric comparisons, scarce statistical testing, and data/preprocessing inconsistencies that undermine credibility. It documents four main pitfalls and proposes a concise MT evaluation guideline plus a 4-point meta-evaluation score to assess credibility. The work highlights the risk of drawing strong conclusions from non-identical data or copied results and demonstrates the need for standardized reporting tools like SacreBLEU. By providing actionable guidelines, it aims to improve reproducibility and the scientific integrity of MT research.

Abstract

This paper presents the first large-scale meta-evaluation of machine translation (MT). We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020. Our study shows that practices for automatic MT evaluation have dramatically changed during the past decade and follow concerning trends. An increasing number of MT evaluations exclusively rely on differences between BLEU scores to draw conclusions, without performing any kind of statistical significance testing nor human evaluation, while at least 108 metrics claiming to be better than BLEU have been proposed. MT evaluations in recent papers tend to copy and compare automatic metric scores from previous work to claim the superiority of a method or an algorithm without confirming neither exactly the same training, validating, and testing data have been used nor the metric scores are comparable. Furthermore, tools for reporting standardized metric scores are still far from being widely adopted by the MT community. After showing how the accumulation of these pitfalls leads to dubious evaluation, we propose a guideline to encourage better automatic MT evaluation along with a simple meta-evaluation scoring method to assess its credibility.

Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

TL;DR

The paper conducts the first large-scale meta-evaluation of MT evaluation practices across 769 papers (2010–2020) and reveals pervasive BLEU-centric comparisons, scarce statistical testing, and data/preprocessing inconsistencies that undermine credibility. It documents four main pitfalls and proposes a concise MT evaluation guideline plus a 4-point meta-evaluation score to assess credibility. The work highlights the risk of drawing strong conclusions from non-identical data or copied results and demonstrates the need for standardized reporting tools like SacreBLEU. By providing actionable guidelines, it aims to improve reproducibility and the scientific integrity of MT research.

Abstract

This paper presents the first large-scale meta-evaluation of machine translation (MT). We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020. Our study shows that practices for automatic MT evaluation have dramatically changed during the past decade and follow concerning trends. An increasing number of MT evaluations exclusively rely on differences between BLEU scores to draw conclusions, without performing any kind of statistical significance testing nor human evaluation, while at least 108 metrics claiming to be better than BLEU have been proposed. MT evaluations in recent papers tend to copy and compare automatic metric scores from previous work to claim the superiority of a method or an algorithm without confirming neither exactly the same training, validating, and testing data have been used nor the metric scores are comparable. Furthermore, tools for reporting standardized metric scores are still far from being widely adopted by the MT community. After showing how the accumulation of these pitfalls leads to dubious evaluation, we propose a guideline to encourage better automatic MT evaluation along with a simple meta-evaluation scoring method to assess its credibility.

Paper Structure

This paper contains 11 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Percentage of papers using each evaluation metric per year. Metrics displayed are used in more than five papers. "Other" denotes all other automatic metrics. "Human" denotes that a human evaluation has been conducted.
  • Figure 2: Percentage of papers testing statistical significance of differences between metric scores.
  • Figure 3: Percentage of papers copying scores from previous work ("Copied scores"), using SacreBLEU ("SacreBLEU"), and copying scores without using SacreBLEU ("Copied w/o SacreBLEU").
  • Figure 4: Percentage of papers that compared MT systems using data that are not identical.
  • Figure 5: Percentage of papers affected by the accumulation of pitfalls. Each bar considers only the papers counted by the previous bar, e.g., the last bar considers only papers that compared MT systems exploiting different datasets while exclusively using BLEU ("BLEU only"), without performing statistical significance testing ("w/o sigtest"), to measure differences with BLEU scores copied from other papers ("w/ copied scores") while not using SacreBLEU ("w/o SacreBLEU").
  • ...and 1 more figures