Mutual Information and Diverse Decoding Improve Neural Machine Translation
Jiwei Li, Dan Jurafsky
TL;DR
This paper tackles the unidirectional limitation of standard neural MT by introducing a maximum mutual information objective that incorporates the reverse translation probability p(x|y). It implements this idea through a practical reranking framework using two separately trained models (p(y|x) and p(x|y)) and enhances it with a diversity-promoting decoding scheme to produce a richer N-best list for reranking. Empirical results on WMT English–German and English–French show consistent BLEU gains across both LSTM and attention-based architectures, with additional improvements from language-model reranking and unknown word replacement. The work offers a simple, modular approach to leverage bidirectional information and diversity in neural MT, with potential applicability to other neural generation tasks and directions for integrating MI directly into first-pass decoding.
Abstract
Sequence-to-sequence neural translation models learn semantic and syntactic relations between sentence pairs by optimizing the likelihood of the target given the source, i.e., $p(y|x)$, an objective that ignores other potentially useful sources of information. We introduce an alternative objective function for neural MT that maximizes the mutual information between the source and target sentences, modeling the bi-directional dependency of sources and targets. We implement the model with a simple re-ranking method, and also introduce a decoding algorithm that increases diversity in the N-best list produced by the first pass. Applied to the WMT German/English and French/English tasks, the proposed models offers a consistent performance boost on both standard LSTM and attention-based neural MT architectures.
