Table of Contents
Fetching ...

Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation

Malik Marmonier, Benoît Sagot, Rachel Bawden

TL;DR

This paper investigates two complementary paradigms for predicting machine translation (MT) quality: source-side difficulty prediction and candidate-side quality estimation (QE), highlighting that the architectural shift towards LLMs alters the reliability of established quality prediction methods while simultaneously mitigating previous challenges in document-level translation.

Abstract

This paper investigates two complementary paradigms for predicting machine translation (MT) quality: source-side difficulty prediction and candidate-side quality estimation (QE). The rapid adoption of Large Language Models (LLMs) into MT workflows is reshaping the research landscape, yet its impact on established quality prediction paradigms remains underexplored. We study this issue through a series of "hindsight" experiments on a unique, multi-candidate dataset resulting from a genuine MT post-editing (MTPE) project. The dataset consists of over 6,000 English source segments with nine translation hypotheses from a diverse set of traditional neural MT systems and advanced LLMs, all evaluated against a single, final human post-edited reference. Using Kendall's rank correlation, we assess the predictive power of source-side difficulty metrics, candidate-side QE models and position heuristics against two gold-standard scores: TER (as a proxy for post-editing effort) and COMET (as a proxy for human judgment). Our findings highlight that the architectural shift towards LLMs alters the reliability of established quality prediction methods while simultaneously mitigating previous challenges in document-level translation.

Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation

TL;DR

This paper investigates two complementary paradigms for predicting machine translation (MT) quality: source-side difficulty prediction and candidate-side quality estimation (QE), highlighting that the architectural shift towards LLMs alters the reliability of established quality prediction methods while simultaneously mitigating previous challenges in document-level translation.

Abstract

This paper investigates two complementary paradigms for predicting machine translation (MT) quality: source-side difficulty prediction and candidate-side quality estimation (QE). The rapid adoption of Large Language Models (LLMs) into MT workflows is reshaping the research landscape, yet its impact on established quality prediction paradigms remains underexplored. We study this issue through a series of "hindsight" experiments on a unique, multi-candidate dataset resulting from a genuine MT post-editing (MTPE) project. The dataset consists of over 6,000 English source segments with nine translation hypotheses from a diverse set of traditional neural MT systems and advanced LLMs, all evaluated against a single, final human post-edited reference. Using Kendall's rank correlation, we assess the predictive power of source-side difficulty metrics, candidate-side QE models and position heuristics against two gold-standard scores: TER (as a proxy for post-editing effort) and COMET (as a proxy for human judgment). Our findings highlight that the architectural shift towards LLMs alters the reliability of established quality prediction methods while simultaneously mitigating previous challenges in document-level translation.
Paper Structure (22 sections, 8 equations, 12 figures, 3 tables)

This paper contains 22 sections, 8 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Post-editing interface used to compile the French partition of the OLDI Seed Corpus. Image borrowed from marmonier-sagot-bawden:2025:WMT.
  • Figure 2: Kendall's $\tau$ correlation between source-side metrics and translation quality as measured by TER. An asterisk ($^*$) indicates a statistically significant correlation ($p < 0.05$).
  • Figure 3: Kendall's $\tau$ correlation between source-side metrics and translation quality as measured by COMET. An asterisk ($^*$) indicates a statistically significant correlation ($p < 0.05$).
  • Figure 4: Kendall's $\tau$ correlation between two reference-free QE metrics (COMET_QE, MetricX_QE) and our gold-standard reference scores. The top plot shows correlations against reference TER, while the bottom plot uses reference COMET. An asterisk (*) indicates statistical significance ($p<0.05$).
  • Figure 5: Kendall's $\tau$ correlation between QE metrics and gold-standard scores, averaged across system groups. The top row shows correlations against reference TER; the bottom row uses reference COMET. The left column groups by system type (LLM vs. NMT), while the right column groups by translation granularity.
  • ...and 7 more figures