Table of Contents
Fetching ...

Correcting Length Bias in Neural Machine Translation

Kenton Murray, David Chiang

TL;DR

This paper identifies beam-search degradation and translation brevity in neural machine translation as manifestations of label bias in locally normalized models. It argues for a lightweight, globally-normalized correction via a tunable per-word reward (γ) and compares it to length normalization across multiple language pairs, showing it can largely eliminate the beam problem and improve translation quality. A perceptron-like method enables fast, dataset-specific tuning of γ, with optimal values highly dependent on beam size and task. The findings suggest incorporating such global corrections into baselines can significantly improve decoding robustness and translation quality in NMT.

Abstract

We study two problems in neural machine translation (NMT). First, in beam search, whereas a wider beam should in principle help translation, it often hurts NMT. Second, NMT has a tendency to produce translations that are too short. Here, we argue that these problems are closely related and both rooted in label bias. We show that correcting the brevity problem almost eliminates the beam problem; we compare some commonly-used methods for doing this, finding that a simple per-word reward works well; and we introduce a simple and quick way to tune this reward using the perceptron algorithm.

Correcting Length Bias in Neural Machine Translation

TL;DR

This paper identifies beam-search degradation and translation brevity in neural machine translation as manifestations of label bias in locally normalized models. It argues for a lightweight, globally-normalized correction via a tunable per-word reward (γ) and compares it to length normalization across multiple language pairs, showing it can largely eliminate the beam problem and improve translation quality. A perceptron-like method enables fast, dataset-specific tuning of γ, with optimal values highly dependent on beam size and task. The findings suggest incorporating such global corrections into baselines can significantly improve decoding robustness and translation quality in NMT.

Abstract

We study two problems in neural machine translation (NMT). First, in beam search, whereas a wider beam should in principle help translation, it often hurts NMT. Second, NMT has a tendency to produce translations that are too short. Here, we argue that these problems are closely related and both rooted in label bias. We show that correcting the brevity problem almost eliminates the beam problem; we compare some commonly-used methods for doing this, finding that a simple per-word reward works well; and we introduce a simple and quick way to tune this reward using the perceptron algorithm.

Paper Structure

This paper contains 22 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Label bias causes this toy word-by-word translation model to translate French un hélicoptère incorrectly to an autogyro.
  • Figure 2: A locally normalized model must determine, at each time step, a "budget" for the total remaining log-probability. In this example sentence, "The British women won Olymp ic gold in p airs row ing," the empty translation has initial position 622 in the beam. Already by the third step of decoding, the correct translation has a lower score than the empty translation. However, using greedy search, a nonempty translation would be returned.
  • Figure 3: Impact of beam size on BLEU score when varying reference sentence lengths (in words) for Russian--English. The x-axis is cumulative moving right; length 20 includes sentences of length 0-20, while length 10 includes 0-10. As reference length increases, the BLEU scores of a baseline system with beam size of 10 remain nearly constant. However, a baseline system with beam 1000 has a high BLEU score for shorter sentences, but a very low score when the entire test set is used. Our tuned reward and normalized models do not suffer from this problem on the entire test set, but take a slight performance hit on the shortest sentences.
  • Figure 4: Histogram of length ratio between generated sentences and gold varied across methods and beam size for Russian--English. Note that the baseline method skews closer 0 as the beam size increases, while our other methods remain peaked around 1.0. There are a few outliers to the right that have been cut off, as well as the peaks at 0.0 and 1.0.
  • Figure 5: Effect of word penalty on BLEU and hypothesis length for Russian--English (top) and German-English (bottom) on 1000 unseen dev examples with beams of 50. Note that the vertical bars represent the word reward that was found during tuning.