Overcoming the Curse of Sentence Length for Neural Machine Translation using Automatic Segmentation
Jean Pouget-Abadie, Dzmitry Bahdanau, Bart van Merrienboer, Kyunghyun Cho, Yoshua Bengio
TL;DR
This paper tackles the problem of neural machine translation quality deterioration for long sentences by proposing automatic segmentation of the input into shorter, translatable clauses. A bidirectional, confidence-based scoring mechanism guides segment construction, and a dynamic-programming optimization selects an optimal segmentation. Empirical results show notable BLEU improvements for long sentences and robustness to unknown words, though concatenating independently translated segments can hurt fluency and punctuation. The work highlights a practical route to extending neural MT to longer inputs while underscoring the need for better post-segmentation reordering and fluency enhancement.
Abstract
The authors of (Cho et al., 2014a) have shown that the recently introduced neural network translation systems suffer from a significant drop in translation quality when translating long sentences, unlike existing phrase-based translation systems. In this paper, we propose a way to address this issue by automatically segmenting an input sentence into phrases that can be easily translated by the neural network translation model. Once each segment has been independently translated by the neural machine translation model, the translated clauses are concatenated to form a final translation. Empirical results show a significant improvement in translation quality for long sentences.
