Table of Contents
Fetching ...

Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite

Klaudia Thellmann, Bernhard Stadler, Michael Färber

Abstract

Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.

Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite

Abstract

Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.

Paper Structure

This paper contains 25 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: EU20 reference-free quality landscape. Left: median xCOMET-XXL per language$\times$dataset on a unified $[0,1]$ scale; short in-cell tick encodes IQR ($Q_3{-}Q_1$). Middle: median target-side sentence length (words). Right: Spearman correlation ($\rho$) between length and score (negative $\rho$ indicates lower scores for longer outputs). Rows are aligned across panels and sorted by the language-wise median across datasets.
  • Figure 2: EU20 vs. OKAPI xCOMET-XXL reference-free quality comparison per language$\times$dataset. Cells report the median difference $\Delta=\mathrm{median}(EU20)-\mathrm{median}(Okapi)$ on the paired overlap and the win-rate (% items where EU20 $>$ Okapi). Positive $\Delta$ favors EU20.
  • Figure 3: Critical-difference (CD) diagram on MMLU (ref-free). Points are systems' average ranks across five languages (lower is better). Thin bars show Nemenyi intervals (avg$\pm$CD/2; $\alpha{=}0.05$, $k{=}3$, $N{=}5$). A grey bridge links systems that are not significantly different (no bridge = significant).
  • Figure 4: $\Delta_{\text{ref}}$ (EU20$-$Okapi) of reference-based xCOMET-XXL on MMLU (reference = Global-MMLU). Bars show $\Delta_{\text{ref}}$ per language with 95% paired bootstrap CIs. Zero line indicates parity. Sorted by $\Delta_{\text{ref}}$. Common items only.
  • Figure 5: EU20 error overview per $\text{language}\times\text{dataset}$. Each cell shows four horizontal bars: A+M, F, O, and Clean. Error rates per 1,000 items.