Table of Contents
Fetching ...

Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation

Bryan Eikema, Wilker Aziz

TL;DR

This work argues that maximum a posteriori (MAP) decoding is an inadequate decision rule for neural machine translation (NMT) and that many observed pathologies stem from decoding choices rather than the model or training objective. By sampling from the model’s translation distribution and evaluating with hierarchical Bayesian methods, the authors show that the distribution covers a broad set of plausible translations, with the mode often carrying little probability mass. They demonstrate that beam search biases statistics and that sampling-based approaches, particularly minimum Bayes risk (MBR) decoding, can recover translations that align with data statistics and sometimes match or surpass beam results, especially in low-resource or domain-shift settings. The paper advocates treating NMT as probabilistic distributions and developing decision rules that exploit the full distribution rather than fixating on the mode. Practical impact lies in guiding future decoding strategies toward holistic, sampling-based methods to improve translation quality and robustness.

Abstract

Recent studies have revealed a number of pathologies of neural machine translation (NMT) systems. Hypotheses explaining these mostly suggest there is something fundamentally wrong with NMT as a model or its training algorithm, maximum likelihood estimation (MLE). Most of this evidence was gathered using maximum a posteriori (MAP) decoding, a decision rule aimed at identifying the highest-scoring translation, i.e. the mode. We argue that the evidence corroborates the inadequacy of MAP decoding more than casts doubt on the model and its training algorithm. In this work, we show that translation distributions do reproduce various statistics of the data well, but that beam search strays from such statistics. We show that some of the known pathologies and biases of NMT are due to MAP decoding and not to NMT's statistical assumptions nor MLE. In particular, we show that the most likely translations under the model accumulate so little probability mass that the mode can be considered essentially arbitrary. We therefore advocate for the use of decision rules that take into account the translation distribution holistically. We show that an approximation to minimum Bayes risk decoding gives competitive results confirming that NMT models do capture important aspects of translation well in expectation.

Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation

TL;DR

This work argues that maximum a posteriori (MAP) decoding is an inadequate decision rule for neural machine translation (NMT) and that many observed pathologies stem from decoding choices rather than the model or training objective. By sampling from the model’s translation distribution and evaluating with hierarchical Bayesian methods, the authors show that the distribution covers a broad set of plausible translations, with the mode often carrying little probability mass. They demonstrate that beam search biases statistics and that sampling-based approaches, particularly minimum Bayes risk (MBR) decoding, can recover translations that align with data statistics and sometimes match or surpass beam results, especially in low-resource or domain-shift settings. The paper advocates treating NMT as probabilistic distributions and developing decision rules that exploit the full distribution rather than fixating on the mode. Practical impact lies in guiding future decoding strategies toward holistic, sampling-based methods to improve translation quality and robustness.

Abstract

Recent studies have revealed a number of pathologies of neural machine translation (NMT) systems. Hypotheses explaining these mostly suggest there is something fundamentally wrong with NMT as a model or its training algorithm, maximum likelihood estimation (MLE). Most of this evidence was gathered using maximum a posteriori (MAP) decoding, a decision rule aimed at identifying the highest-scoring translation, i.e. the mode. We argue that the evidence corroborates the inadequacy of MAP decoding more than casts doubt on the model and its training algorithm. In this work, we show that translation distributions do reproduce various statistics of the data well, but that beam search strays from such statistics. We show that some of the known pathologies and biases of NMT are due to MAP decoding and not to NMT's statistical assumptions nor MLE. In particular, we show that the most likely translations under the model accumulate so little probability mass that the mode can be considered essentially arbitrary. We therefore advocate for the use of decision rules that take into account the translation distribution holistically. We show that an approximation to minimum Bayes risk decoding gives competitive results confirming that NMT models do capture important aspects of translation well in expectation.

Paper Structure

This paper contains 19 sections, 5 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: A comparison using hierarchical Bayesian models of statistics extracted from beam search outputs, samples from the model and gold-standard references. We show the posterior density on the y-axis, and the mean Poisson rate (length) and agreement with training data (unigrams, bigrams, skip-bigrams) on the x-axis for each group and language pair.
  • Figure 2: Cumulative probability of the unique translations in 1,000 ancestral samples on the held-out (top), and newstest2018 / Flores (bottom) test sets. The dark blue line shows the average cumulative probability over all test sentences, the shaded area represents 1 standard deviation away from the average. The black dots to the right show the final cumulative probability for each individual test sentence.
  • Figure 3: METEOR scores for oracle-selected samples as a function of sample size on the held-out data (top) and newstest2018 / Flores (bottom) test sets. For each sample size we repeat the experiment $4$ times and show a box plot per sample size. Dashed blue lines show beam search scores.
  • Figure :