Table of Contents
Fetching ...

MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators

Qingyu Lu, Liang Ding, Kanjian Zhang, Jinxia Zhang, Dacheng Tao

TL;DR

MQM-APE tackles the misalignment between LLM-based error annotations and human MQM judgments in MT evaluation. It introduces a training-free pipeline with Error Analysis Evaluator, Automatic Post-Editor, and Pairwise Quality Verifier to filter errors via post-editing and verification of quality improvement, yielding more reliable and interpretable error spans. Across eight LLMs and both high- and low-resource languages (WMT22 and IndicMT), MQM-APE consistently improves GEMBA-MQM at system and segment levels and complements translation-specific evaluators like Tower. The work analyzes cost, error distributions, and LLM selection, offering practical guidance for deploying interpretable LLM-based MT evaluators in real-world settings. Overall, MQM-APE provides a reusable, model-agnostic method to refine error annotations in LLM-driven MT quality assessment.

Abstract

Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment, providing both scores and fine-grained feedback. Although approaches such as GEMBA-MQM have shown state-of-the-art performance on reference-free evaluation, the predicted errors do not align well with those annotated by human, limiting their interpretability as feedback signals. To enhance the quality of error annotations predicted by LLM evaluators, we introduce a universal and training-free framework, $\textbf{MQM-APE}$, based on the idea of filtering out non-impactful errors by Automatically Post-Editing (APE) the original translation based on each error, leaving only those errors that contribute to quality improvement. Specifically, we prompt the LLM to act as 1) $\textit{evaluator}$ to provide error annotations, 2) $\textit{post-editor}$ to determine whether errors impact quality improvement and 3) $\textit{pairwise quality verifier}$ as the error filter. Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM, across eight LLMs in both high- and low-resource languages. Orthogonal to trained approaches, MQM-APE complements translation-specific evaluators such as Tower, highlighting its broad applicability. Further analysis confirms the effectiveness of each module and offers valuable insights into evaluator design and LLMs selection.

MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators

TL;DR

MQM-APE tackles the misalignment between LLM-based error annotations and human MQM judgments in MT evaluation. It introduces a training-free pipeline with Error Analysis Evaluator, Automatic Post-Editor, and Pairwise Quality Verifier to filter errors via post-editing and verification of quality improvement, yielding more reliable and interpretable error spans. Across eight LLMs and both high- and low-resource languages (WMT22 and IndicMT), MQM-APE consistently improves GEMBA-MQM at system and segment levels and complements translation-specific evaluators like Tower. The work analyzes cost, error distributions, and LLM selection, offering practical guidance for deploying interpretable LLM-based MT evaluators in real-world settings. Overall, MQM-APE provides a reusable, model-agnostic method to refine error annotations in LLM-driven MT quality assessment.

Abstract

Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment, providing both scores and fine-grained feedback. Although approaches such as GEMBA-MQM have shown state-of-the-art performance on reference-free evaluation, the predicted errors do not align well with those annotated by human, limiting their interpretability as feedback signals. To enhance the quality of error annotations predicted by LLM evaluators, we introduce a universal and training-free framework, , based on the idea of filtering out non-impactful errors by Automatically Post-Editing (APE) the original translation based on each error, leaving only those errors that contribute to quality improvement. Specifically, we prompt the LLM to act as 1) to provide error annotations, 2) to determine whether errors impact quality improvement and 3) as the error filter. Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM, across eight LLMs in both high- and low-resource languages. Orthogonal to trained approaches, MQM-APE complements translation-specific evaluators such as Tower, highlighting its broad applicability. Further analysis confirms the effectiveness of each module and offers valuable insights into evaluator design and LLMs selection.
Paper Structure (67 sections, 6 equations, 7 figures, 18 tables)

This paper contains 67 sections, 6 equations, 7 figures, 18 tables.

Figures (7)

  • Figure 1: A comparative overview of our MQM-APE approach. The evaluated translation passes through three sequential modules, all operated by the same LLM: 1) the Error Analysis Evaluator, which provides detailed error demonstrations; 2) APE, which post-edits the translation based on each error annotation; and 3) the Pairwise Quality Verifier, which verifies whether quality improves after post-editing.
  • Figure 2: Comparison between MQM-APE, random error filter ("Random") and GEMBA-MQM ("MQM") on segment-level performance.
  • Figure 3: Comparison between MQM-APE with an LLM verifier and with $\textbf{CometKiwi}_{\textbf{22}}^{\textbf{QE}}$ as a replacement on segment-level performance.
  • Figure 4: Average number of errors retained or discarded for each severity level with MQM-APE.
  • Figure 5: Distribution of error categories between GEMBA-MQM and MQM-APE.
  • ...and 2 more figures