QUEST: Quality-Aware Metropolis-Hastings Sampling for Machine Translation

Gonçalo R. A. Faria; Sweta Agrawal; António Farinhas; Ricardo Rei; José G. C. de Souza; André F. T. Martins

QUEST: Quality-Aware Metropolis-Hastings Sampling for Machine Translation

Gonçalo R. A. Faria, Sweta Agrawal, António Farinhas, Ricardo Rei, José G. C. de Souza, André F. T. Martins

TL;DR

This paper provides a simple and effective way to avoid over-reliance on noisy quality estimates by using them as the energy function of a Gibbs distribution and generates multiple samples from high-density areas through the Metropolis-Hastings algorithm, a simple Markov chain Monte Carlo approach.

Abstract

An important challenge in machine translation (MT) is to generate high-quality and diverse translations. Prior work has shown that the estimated likelihood from the MT model correlates poorly with translation quality. In contrast, quality evaluation metrics (such as COMET or BLEURT) exhibit high correlations with human judgments, which has motivated their use as rerankers (such as quality-aware and minimum Bayes risk decoding). However, relying on a single translation with high estimated quality increases the chances of "gaming the metric''. In this paper, we address the problem of sampling a set of high-quality and diverse translations. We provide a simple and effective way to avoid over-reliance on noisy quality estimates by using them as the energy function of a Gibbs distribution. Instead of looking for a mode in the distribution, we generate multiple samples from high-density areas through the Metropolis-Hastings algorithm, a simple Markov chain Monte Carlo approach. The results show that our proposed method leads to high-quality and diverse outputs across multiple language pairs (English$\leftrightarrow${German, Russian}) with two strong decoder-only LLMs (Alma-7b, Tower-7b).

QUEST: Quality-Aware Metropolis-Hastings Sampling for Machine Translation

TL;DR

Abstract

{German, Russian}) with two strong decoder-only LLMs (Alma-7b, Tower-7b).

Paper Structure (36 sections, 18 equations, 10 figures, 1 algorithm)

This paper contains 36 sections, 18 equations, 10 figures, 1 algorithm.

Introduction
Background
Large Language Models for Machine Translation
Automatic Metrics for Machine Translation
An MCMC-based Decoding Approach for Text Generation
Metropolis-Hastings
Proposal distribution
Connections to Reinforcement Learning with Human Feedback
Experimental Settings
Data and Evaluation
Models
Automatic Metrics for Quest
Decoding Configurations
Compute Comparison: Ancestral Vs Quest
Results
...and 21 more sections

Figures (10)

Figure 1: Quest samples an index from the current translation ($y^t$), removes all elements to the right of the index, generates a new continuation, and then uses the Metropolis-Hastings acceptance criterion to decide whether to accept or reject the resulting new translation. The process continues for a fixed number of $T$ iterations.
Figure 2: Average quality vs. diversity on WMT23 datasets. Different points represent different hyperparameter values. Quest outperforms ancestral sampling in six out of eight settings.
Figure 3: Average Quality by xComet-XL (left) and CometKiwi-XL on English-Russian dataset using Tower-7b
Figure 4: Average quality (CometKiwi-XL) vs. diversity (Pairwise-BLEU) on WMT23 datasets. Different points represent different hyperparameter values.
Figure 5: Average quality (xComet-XL) vs. diversity (Pairwise-BLEU) on additional LPs from WMT23 and WMT22. Different points represent different hyperparameter values.
...and 5 more figures

QUEST: Quality-Aware Metropolis-Hastings Sampling for Machine Translation

TL;DR

Abstract

QUEST: Quality-Aware Metropolis-Hastings Sampling for Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)