Table of Contents
Fetching ...

MT-Ranker: Reference-free machine translation evaluation by inter-system ranking

Ibraheem Muhammad Moosa, Rui Zhang, Wenpeng Yin

TL;DR

This work reframes MT evaluation as a reference-free, pairwise ranking task and introduces MT-Ranker, a multilingual T5-based model that predicts which translation in a pair is better. The training uses a three-stage pipeline: indirect supervision from cross-lingual NLI, discrimination between human and machine translations, and weakly supervised synthetic data, enabling state-of-the-art correlations without human-annotated supervision. Across DA20, MQM, and ACES benchmarks, MT-Ranker achieves top performance relative to both reference-free and some reference-based baselines, illustrating strong practical utility when references are unavailable. The study also analyzes ablations, zero-shot generalization to unseen language pairs, and untranslated phenomena, highlighting both the method’s robustness and its current limitations at certain error categories.

Abstract

Traditionally, Machine Translation (MT) Evaluation has been treated as a regression problem -- producing an absolute translation-quality score. This approach has two limitations: i) the scores lack interpretability, and human annotators struggle with giving consistent scores; ii) most scoring methods are based on (reference, translation) pairs, limiting their applicability in real-world scenarios where references are absent. In practice, we often care about whether a new MT system is better or worse than some competitors. In addition, reference-free MT evaluation is increasingly practical and necessary. Unfortunately, these two practical considerations have yet to be jointly explored. In this work, we formulate the reference-free MT evaluation into a pairwise ranking problem. Given the source sentence and a pair of translations, our system predicts which translation is better. In addition to proposing this new formulation, we further show that this new paradigm can demonstrate superior correlation with human judgments by merely using indirect supervision from natural language inference and weak supervision from our synthetic data. In the context of reference-free evaluation, MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21. On a more challenging benchmark, ACES, which contains fine-grained evaluation criteria such as addition, omission, and mistranslation errors, MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.

MT-Ranker: Reference-free machine translation evaluation by inter-system ranking

TL;DR

This work reframes MT evaluation as a reference-free, pairwise ranking task and introduces MT-Ranker, a multilingual T5-based model that predicts which translation in a pair is better. The training uses a three-stage pipeline: indirect supervision from cross-lingual NLI, discrimination between human and machine translations, and weakly supervised synthetic data, enabling state-of-the-art correlations without human-annotated supervision. Across DA20, MQM, and ACES benchmarks, MT-Ranker achieves top performance relative to both reference-free and some reference-based baselines, illustrating strong practical utility when references are unavailable. The study also analyzes ablations, zero-shot generalization to unseen language pairs, and untranslated phenomena, highlighting both the method’s robustness and its current limitations at certain error categories.

Abstract

Traditionally, Machine Translation (MT) Evaluation has been treated as a regression problem -- producing an absolute translation-quality score. This approach has two limitations: i) the scores lack interpretability, and human annotators struggle with giving consistent scores; ii) most scoring methods are based on (reference, translation) pairs, limiting their applicability in real-world scenarios where references are absent. In practice, we often care about whether a new MT system is better or worse than some competitors. In addition, reference-free MT evaluation is increasingly practical and necessary. Unfortunately, these two practical considerations have yet to be jointly explored. In this work, we formulate the reference-free MT evaluation into a pairwise ranking problem. Given the source sentence and a pair of translations, our system predicts which translation is better. In addition to proposing this new formulation, we further show that this new paradigm can demonstrate superior correlation with human judgments by merely using indirect supervision from natural language inference and weak supervision from our synthetic data. In the context of reference-free evaluation, MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21. On a more challenging benchmark, ACES, which contains fine-grained evaluation criteria such as addition, omission, and mistranslation errors, MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
Paper Structure (46 sections, 12 equations, 4 figures, 10 tables)

This paper contains 46 sections, 12 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Our system receives a pair of translations and makes a binary decision on which translation has better quality. In contrast, traditional reference-free evaluation systems generate a quality score for a single translation. The main difference between our approach and previous approaches is highlighted in red.
  • Figure 2: Illustration of the input format and the architecture of MT-Ranker. The source sentence and the translation pairs are formatted as a single text input to the system. The bidirectional LLM attends over all the tokens of the source and the translations simultaneously.
  • Figure 3: Impact of removing training stages on the performance of MT-Ranker-Large on the DA20 dataset.
  • Figure 4: Performance improves on all benchmarks after supervised training.