Table of Contents
Fetching ...

Don't Rank, Combine! Combining Machine Translation Hypotheses Using Quality Estimation

Giorgos Vernikos, Andrei Popescu-Belis

TL;DR

The paper tackles the misalignment between MT model probabilities and human judgments by introducing QE-fusion, a span-level translation synthesis that merges fragments from a diverse candidate pool using a quality-estimation metric. The method is tested across multiple open-source LLMs and multilingual MT models on five language pairs, showing consistent gains over beam search, MBR decoding, and QE-reranking, with particularly strong improvements for diverse LLM outputs and linear runtime scaling. Key findings include generation of many novel translations, reduced hallucinations, and robust performance when increasing pool size, indicating strong scalability and practicality. The approach offers a generalizable framework for improving text generation in MT and potentially other tasks where reward or quality metrics can guide span-level merging.

Abstract

Neural machine translation systems estimate probabilities of target sentences given source sentences, yet these estimates may not align with human preferences. This work introduces QE-fusion, a method that synthesizes translations using a quality estimation metric (QE), which correlates better with human judgments. QE-fusion leverages a pool of candidates sampled from a model, combining spans from different candidates using a QE metric such as CometKiwi. We compare QE-fusion against beam search and recent reranking techniques, such as Minimum Bayes Risk decoding or QE-reranking. Our method consistently improves translation quality in terms of COMET and BLEURT scores when applied to large language models (LLMs) used for translation (PolyLM, XGLM, Llama2, Mistral, ALMA, and Tower) and to multilingual translation models (NLLB), over five language pairs. Notably, QE-fusion exhibits larger improvements for LLMs due to their ability to generate diverse outputs. We demonstrate that our approach generates novel translations in over half of the cases and consistently outperforms other methods across varying numbers of candidates (5-200). Furthermore, we empirically establish that QE-fusion scales linearly with the number of candidates in the pool.

Don't Rank, Combine! Combining Machine Translation Hypotheses Using Quality Estimation

TL;DR

The paper tackles the misalignment between MT model probabilities and human judgments by introducing QE-fusion, a span-level translation synthesis that merges fragments from a diverse candidate pool using a quality-estimation metric. The method is tested across multiple open-source LLMs and multilingual MT models on five language pairs, showing consistent gains over beam search, MBR decoding, and QE-reranking, with particularly strong improvements for diverse LLM outputs and linear runtime scaling. Key findings include generation of many novel translations, reduced hallucinations, and robust performance when increasing pool size, indicating strong scalability and practicality. The approach offers a generalizable framework for improving text generation in MT and potentially other tasks where reward or quality metrics can guide span-level merging.

Abstract

Neural machine translation systems estimate probabilities of target sentences given source sentences, yet these estimates may not align with human preferences. This work introduces QE-fusion, a method that synthesizes translations using a quality estimation metric (QE), which correlates better with human judgments. QE-fusion leverages a pool of candidates sampled from a model, combining spans from different candidates using a QE metric such as CometKiwi. We compare QE-fusion against beam search and recent reranking techniques, such as Minimum Bayes Risk decoding or QE-reranking. Our method consistently improves translation quality in terms of COMET and BLEURT scores when applied to large language models (LLMs) used for translation (PolyLM, XGLM, Llama2, Mistral, ALMA, and Tower) and to multilingual translation models (NLLB), over five language pairs. Notably, QE-fusion exhibits larger improvements for LLMs due to their ability to generate diverse outputs. We demonstrate that our approach generates novel translations in over half of the cases and consistently outperforms other methods across varying numbers of candidates (5-200). Furthermore, we empirically establish that QE-fusion scales linearly with the number of candidates in the pool.
Paper Structure (33 sections, 1 equation, 7 figures, 9 tables, 1 algorithm)

This paper contains 33 sections, 1 equation, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of the QE-fusion pipeline. The method first generates multiple hypotheses by sampling translations from the model. Then, it computes and sorts the spans that diverge among the candidates. Finally, a QE metric is used to select a span from each group and these spans are merged to form a new, refined translation.
  • Figure 2: BLEURT scores of QE-fusion and other methods over pools of candidates of increasing sizes from the XGLM-2.9B LLM. QE-fusion outperforms reranking approaches and is comparable to the COMET-reranking oracle for pools of up to 25 candidates.
  • Figure 3: Frequencies at which outputs produced by QE-fusion appear in larger candidate pools sampled from XGLM-2.9B. Results show that QE-fusion always synthesizes a substantial number of novel candidates that the LLM would not generate otherwise.
  • Figure 4: Effect of temperature on translation performance (above) and on the diversity of the pool (below), using an LLM and an NMT model for en$\rightarrow$de translation, with QE-fusion vs. QE-reranking.
  • Figure 5: Runtimes (in seconds) for different pool sizes for the en$\rightarrow$de WMT22 test set.
  • ...and 2 more figures