Table of Contents
Fetching ...

Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

Shaomu Tan, Ryosuke Mitani, Ritvik Choudhary, Qiyu Wu, Toshiyuki Sekiya, Christof Monz

TL;DR

Remedy-R addresses the explainability and robustness gaps in automatic MT evaluation by introducing a reasoning-driven metric trained with reinforcement learning from pairwise human preferences. It outputs step-by-step analyses across accuracy, fluency, and completeness, followed by a final 0–100 score, enabling interpretable judgments. With about 60k training pairs across two language directions, Remedy-R matches top scalar metrics and GPT-4-based judges on WMT meta-evaluation and generalizes to additional languages and challenging OOD inputs. The accompanying Remedy-R Agent demonstrates practical utility by using evaluation rationales to refine translations across diverse models, indicating that Remedy-R’s reasoning captures translation-relevant signals useful for real-world translation improvement.

Abstract

Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.

Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

TL;DR

Remedy-R addresses the explainability and robustness gaps in automatic MT evaluation by introducing a reasoning-driven metric trained with reinforcement learning from pairwise human preferences. It outputs step-by-step analyses across accuracy, fluency, and completeness, followed by a final 0–100 score, enabling interpretable judgments. With about 60k training pairs across two language directions, Remedy-R matches top scalar metrics and GPT-4-based judges on WMT meta-evaluation and generalizes to additional languages and challenging OOD inputs. The accompanying Remedy-R Agent demonstrates practical utility by using evaluation rationales to refine translations across diverse models, indicating that Remedy-R’s reasoning captures translation-relevant signals useful for real-world translation improvement.

Abstract

Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.

Paper Structure

This paper contains 44 sections, 9 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Average correlation across WMT23 MQM benchmarks under different numbers of Test-Time Scaling (TTS) evaluation passes. Each configuration aggregates multiple independent evaluations by averaging their final quality scores. TTS consistently improves correlation as the number of evaluation passes increases. Full results are shown in Table \ref{['tab:wmt23']} in Appendix.
  • Figure 2: Refinement performance comparison on the initial translations from GPT-4o-mini and Gemini-2.0-Flash using paragraph-level WMT24++ benchmark. We use reference-based XCOMET-XXL measure the translation quality, and provide more metric results in Appendix.