Don't Rank, Combine! Combining Machine Translation Hypotheses Using Quality Estimation
Giorgos Vernikos, Andrei Popescu-Belis
TL;DR
The paper tackles the misalignment between MT model probabilities and human judgments by introducing QE-fusion, a span-level translation synthesis that merges fragments from a diverse candidate pool using a quality-estimation metric. The method is tested across multiple open-source LLMs and multilingual MT models on five language pairs, showing consistent gains over beam search, MBR decoding, and QE-reranking, with particularly strong improvements for diverse LLM outputs and linear runtime scaling. Key findings include generation of many novel translations, reduced hallucinations, and robust performance when increasing pool size, indicating strong scalability and practicality. The approach offers a generalizable framework for improving text generation in MT and potentially other tasks where reward or quality metrics can guide span-level merging.
Abstract
Neural machine translation systems estimate probabilities of target sentences given source sentences, yet these estimates may not align with human preferences. This work introduces QE-fusion, a method that synthesizes translations using a quality estimation metric (QE), which correlates better with human judgments. QE-fusion leverages a pool of candidates sampled from a model, combining spans from different candidates using a QE metric such as CometKiwi. We compare QE-fusion against beam search and recent reranking techniques, such as Minimum Bayes Risk decoding or QE-reranking. Our method consistently improves translation quality in terms of COMET and BLEURT scores when applied to large language models (LLMs) used for translation (PolyLM, XGLM, Llama2, Mistral, ALMA, and Tower) and to multilingual translation models (NLLB), over five language pairs. Notably, QE-fusion exhibits larger improvements for LLMs due to their ability to generate diverse outputs. We demonstrate that our approach generates novel translations in over half of the cases and consistently outperforms other methods across varying numbers of candidates (5-200). Furthermore, we empirically establish that QE-fusion scales linearly with the number of candidates in the pool.
