Table of Contents
Fetching ...

Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling

Shaomu Tan, Christof Monz

TL;DR

This work tackles the challenge of noisy human ratings in MT evaluation by reframing quality assessment as reward modeling from pairwise human preferences. ReMedy trains a pretrained language model to assign rewards that reflect translation quality using a Bradley-Terry-based ranking objective, augmented with reward regularization and entropy-guided calibration to produce discriminative scores. Across WMT22-24 benchmarks, ReMedy-9B achieves state-of-the-art performance at both segment- and system-level, outperforming larger models and ensemble methods while remaining parameter-efficient. The method also demonstrates robustness on ACES and MSLC challenge sets and yields gains when integrated into MT-RLHF pipelines, highlighting the practical impact of preference-based evaluation for improving MT systems.

Abstract

A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.

Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling

TL;DR

This work tackles the challenge of noisy human ratings in MT evaluation by reframing quality assessment as reward modeling from pairwise human preferences. ReMedy trains a pretrained language model to assign rewards that reflect translation quality using a Bradley-Terry-based ranking objective, augmented with reward regularization and entropy-guided calibration to produce discriminative scores. Across WMT22-24 benchmarks, ReMedy-9B achieves state-of-the-art performance at both segment- and system-level, outperforming larger models and ensemble methods while remaining parameter-efficient. The method also demonstrates robustness on ACES and MSLC challenge sets and yields gains when integrated into MT-RLHF pipelines, highlighting the practical impact of preference-based evaluation for improving MT systems.

Abstract

A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.

Paper Structure

This paper contains 59 sections, 7 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: We report averaged accuracy over system- and segment-level pairwise accuracy for the WMT22 MQM set. The result shows that our largest ReMedy model achieves SOTA performance, surpassing previous WMT winners like MetricX-XXL, COMET, and massive fine-tuned closed LLMs like PaLM2.
  • Figure 2: Kernel density plots of quality scores at various model checkpoints. Percentages indicate training progress stages, with dashed lines marking mean scores.
  • Figure 3: ReMedy data format for training and inference
  • Figure 4: Reward calibration with high temperature
  • Figure 5: Reward calibration with low temperature