Table of Contents
Fetching ...

QE4PE: Word-level Quality Estimation for Human Post-Editing

Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, Arianna Bisazza

TL;DR

QE4PE evaluates word-level quality estimation in a realistic post-editing workflow with 42 professional translators across English→Italian and English→Dutch. The study compares four highlight modalities (No Highlight, Oracle, Supervised XCOMET-XXL, Unsupervised) using a GroTE interface to measure editing productivity, edits, and quality improvements across biomedical and social domains. Findings show domain, language, and editor speed strongly shape highlight effectiveness; while some automatic and human highlights offer measurable precision/recall gains, perceived usability remains low and improvements in accuracy do not straightforwardly translate into productivity gains. The work highlights a gap between QE accuracy and practical utility, suggesting future work should prioritize usability, impact on editorial decisions, and potential combinations with edit guidance to truly support human translators.

Abstract

Word-level quality estimation (QE) methods aim to detect erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. In this study, we investigate the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated from behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors' speed are critical factors in determining highlights' effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.

QE4PE: Word-level Quality Estimation for Human Post-Editing

TL;DR

QE4PE evaluates word-level quality estimation in a realistic post-editing workflow with 42 professional translators across English→Italian and English→Dutch. The study compares four highlight modalities (No Highlight, Oracle, Supervised XCOMET-XXL, Unsupervised) using a GroTE interface to measure editing productivity, edits, and quality improvements across biomedical and social domains. Findings show domain, language, and editor speed strongly shape highlight effectiveness; while some automatic and human highlights offer measurable precision/recall gains, perceived usability remains low and improvements in accuracy do not straightforwardly translate into productivity gains. The work highlights a gap between QE accuracy and practical utility, suggesting future work should prioritize usability, impact on editorial decisions, and potential combinations with edit guidance to truly support human translators.

Abstract

Word-level quality estimation (QE) methods aim to detect erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. In this study, we investigate the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated from behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors' speed are critical factors in determining highlights' effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.

Paper Structure

This paper contains 39 sections, 13 figures, 15 tables.

Figures (13)

  • Figure 1: A summary of the QE4PE study. Documents are translated by a neural MT model and reviewed by professional editors across two translation directions and four highlight modalities. Editing effort, productivity and usability across modalities are estimated from editing logs and questionnaires. Finally, the quality of MT and edited outputs is assessed with MQM/ESA human annotations and automatic metrics.
  • Figure 2: An example of the QE4PE GroTE setup for two segments in an English$\rightarrow$Italian document.
  • Figure 3: Productivity of post-editors across QE4PE stages (Pre, Main, Post). The ➔ marks outstanding entries and ✕ marks missing data. Each row corresponds to the same three translators across all stages.
  • Figure 4: Median quality improvement for post-edited segments at various initial MT quality levels across domains and highlight modalities. Quality scores are estimated using XCOMET segment-level QE (top) and professional ESA annotations (bottom). Histograms show example counts across quality bins for the two metrics. Dotted lines show upper bounds for quality improvements given starting MT quality.
  • Figure 5: Top:QA interface with cropped examples of biomedical and social media texts with error annotations (Biomedical: post-edited segments with No Highlight; Social media: MT outputs). Bottom: Annotation instructions for our MQM-inspired error taxonomy.
  • ...and 8 more figures