Table of Contents
Fetching ...

Can QE-informed (Re)Translation lead to Error Correction?

Govardhan Padmanabhan

TL;DR

Can QE-informed retranslation reduce MT errors without incurring over-editing? The paper compares two training-free QE-informed strategies for segment-level error correction on the WMT 2025 task: (i) a primary method that treats QE as a selector among multiple LLM translations, and (ii) a secondary method that fills in QE-annotated error spans using an LLM with masked tokens. On the test set, the primary approach achieves a positive $\Delta$COMET of 0.0201, while the secondary yields -0.0108, indicating that QE-guided selection can outperform direct post-editing in this setup. The results stress the importance of model selection and prompting strategy for QE-guided edits and suggest future work that combines complementary models and includes human evaluation.

Abstract

The paper presents two approaches submitted to the WMT 2025 Automated Translation Quality Evaluation Systems Task 3 - Quality Estimation (QE)-informed Segment-level Error Correction. While jointly training QE systems with Automatic Post-Editing (APE) has shown improved performance for both tasks, APE systems are still known to overcorrect the output of Machine Translation (MT), leading to a degradation in performance. We investigate a simple training-free approach - QE-informed Retranslation, and compare it with another within the same training-free paradigm. Our winning approach selects the highest-quality translation from multiple candidates generated by different LLMs. The second approach, more akin to APE, instructs an LLM to replace error substrings as specified in the provided QE explanation(s). A conditional heuristic was employed to minimise the number of edits, with the aim of maximising the Gain-to-Edit ratio. The two proposed approaches achieved a Delta COMET score of 0.0201 and -0.0108, respectively, leading the first approach to achieve the winning position on the subtask leaderboard.

Can QE-informed (Re)Translation lead to Error Correction?

TL;DR

Can QE-informed retranslation reduce MT errors without incurring over-editing? The paper compares two training-free QE-informed strategies for segment-level error correction on the WMT 2025 task: (i) a primary method that treats QE as a selector among multiple LLM translations, and (ii) a secondary method that fills in QE-annotated error spans using an LLM with masked tokens. On the test set, the primary approach achieves a positive COMET of 0.0201, while the secondary yields -0.0108, indicating that QE-guided selection can outperform direct post-editing in this setup. The results stress the importance of model selection and prompting strategy for QE-guided edits and suggest future work that combines complementary models and includes human evaluation.

Abstract

The paper presents two approaches submitted to the WMT 2025 Automated Translation Quality Evaluation Systems Task 3 - Quality Estimation (QE)-informed Segment-level Error Correction. While jointly training QE systems with Automatic Post-Editing (APE) has shown improved performance for both tasks, APE systems are still known to overcorrect the output of Machine Translation (MT), leading to a degradation in performance. We investigate a simple training-free approach - QE-informed Retranslation, and compare it with another within the same training-free paradigm. Our winning approach selects the highest-quality translation from multiple candidates generated by different LLMs. The second approach, more akin to APE, instructs an LLM to replace error substrings as specified in the provided QE explanation(s). A conditional heuristic was employed to minimise the number of edits, with the aim of maximising the Gain-to-Edit ratio. The two proposed approaches achieved a Delta COMET score of 0.0201 and -0.0108, respectively, leading the first approach to achieve the winning position on the subtask leaderboard.

Paper Structure

This paper contains 18 sections, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Domain distribution within each language
  • Figure 2: System performance in hypothesis_segment
  • Figure 3: Language-wise count of model responses used in primary approach's final output