QE4PE: Word-level Quality Estimation for Human Post-Editing
Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, Arianna Bisazza
TL;DR
QE4PE evaluates word-level quality estimation in a realistic post-editing workflow with 42 professional translators across English→Italian and English→Dutch. The study compares four highlight modalities (No Highlight, Oracle, Supervised XCOMET-XXL, Unsupervised) using a GroTE interface to measure editing productivity, edits, and quality improvements across biomedical and social domains. Findings show domain, language, and editor speed strongly shape highlight effectiveness; while some automatic and human highlights offer measurable precision/recall gains, perceived usability remains low and improvements in accuracy do not straightforwardly translate into productivity gains. The work highlights a gap between QE accuracy and practical utility, suggesting future work should prioritize usability, impact on editorial decisions, and potential combinations with edit guidance to truly support human translators.
Abstract
Word-level quality estimation (QE) methods aim to detect erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. In this study, we investigate the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated from behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors' speed are critical factors in determining highlights' effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.
