Table of Contents
Fetching ...

Analyzing Uncertainty in Neural Machine Translation

Myle Ott, Michael Auli, David Grangier, Marc'Aurelio Ranzato

TL;DR

This paper investigates how uncertainty affects neural machine translation by distinguishing intrinsic task uncertainty from extrinsic data noise, and by analyzing how beam search and sampling explore the model distribution. It introduces metrics to assess calibration and distribution fit, showing that while search is effective, the model tends to spread probability mass across many hypotheses and under-estimate rare words. A key finding is that training-data copies of the source significantly distort large-beam outputs, linking extrinsic uncertainty to beam degradation, and the authors propose practical mitigation like data filtering and inference constraints. The work also releases human translations for WMT benchmarks to support evaluation of multi-reference translations.

Abstract

Machine translation is a popular test bed for research in neural sequence-to-sequence models but despite much recent research, there is still a lack of understanding of these models. Practitioners report performance degradation with large beams, the under-estimation of rare words and a lack of diversity in the final translations. Our study relates some of these issues to the inherent uncertainty of the task, due to the existence of multiple valid translations for a single source sentence, and to the extrinsic uncertainty caused by noisy training data. We propose tools and metrics to assess how uncertainty in the data is captured by the model distribution and how it affects search strategies that generate translations. Our results show that search works remarkably well but that models tend to spread too much probability mass over the hypothesis space. Next, we propose tools to assess model calibration and show how to easily fix some shortcomings of current models. As part of this study, we release multiple human reference translations for two popular benchmarks.

Analyzing Uncertainty in Neural Machine Translation

TL;DR

This paper investigates how uncertainty affects neural machine translation by distinguishing intrinsic task uncertainty from extrinsic data noise, and by analyzing how beam search and sampling explore the model distribution. It introduces metrics to assess calibration and distribution fit, showing that while search is effective, the model tends to spread probability mass across many hypotheses and under-estimate rare words. A key finding is that training-data copies of the source significantly distort large-beam outputs, linking extrinsic uncertainty to beam degradation, and the authors propose practical mitigation like data filtering and inference constraints. The work also releases human translations for WMT benchmarks to support evaluation of multi-reference translations.

Abstract

Machine translation is a popular test bed for research in neural sequence-to-sequence models but despite much recent research, there is still a lack of understanding of these models. Practitioners report performance degradation with large beams, the under-estimation of rare words and a lack of diversity in the final translations. Our study relates some of these issues to the inherent uncertainty of the task, due to the existence of multiple valid translations for a single source sentence, and to the extrinsic uncertainty caused by noisy training data. We propose tools and metrics to assess how uncertainty in the data is captured by the model distribution and how it affects search strategies that generate translations. Our results show that search works remarkably well but that models tend to spread too much probability mass over the hypothesis space. Next, we propose tools to assess model calibration and show how to easily fix some shortcomings of current models. As part of this study, we release multiple human reference translations for two popular benchmarks.

Paper Structure

This paper contains 20 sections, 2 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Left: Cumulative sequence probability of hypotheses obtained by beam search and sampling on the WMT'14 En-Fr valid set; Center: same, but showing the average per-token probability as we increase the number of considered hypotheses, for each source sentence we select the hypothesis with the maximum probability (orange) or sentence-level BLEU (green); Right: same, but showing averaged sentence-level BLEU as we increase the number of hypotheses.
  • Figure 2: Probability quantiles for tokens in the reference, beam search hypotheses ($k=5$), and sampled hypotheses for the WMT'14 En-Fr validation set.
  • Figure 3: Translation quality of models trained on WMT'17 English-German news-commentary data with added synthetic copy noise in the training data (x-axis) tested with various beam sizes on the validation set.
  • Figure 4: Average probability at each position of the output sequence on the WMT'14 En-Fr validation set, comparing the reference translation, beam search hypothesis ($k=5$), and copying the source sentence.
  • Figure 5: BLEU on newstest2017 as a function of beam width for models trained on all of the WMT'17 En-De training data (original), a filtered version of the training data (filtered) and a small but clean subset of the training data (clean). We also show results when excluding copies as a post-processing step (no copy).
  • ...and 7 more figures