Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models
Minh-Thang Luong, Christopher D. Manning
TL;DR
Open vocabulary neural machine translation remains challenging due to fixed vocabularies. The authors introduce a hybrid word-character architecture that preserves a fast word-level backbone while employing a source-character module for rare words and a character-level generator for unknown target forms, trained end-to-end. On WMT'15 English–Czech, this approach yields significant gains, achieving a state-of-the-art 20.7 BLEU with ensembles and strong chrF3 scores, while removing the need for separate unk replacement steps. The results demonstrate robust translation of Czech morphology and effective representations for English, and they show that purely character models can also be competitive, with future work aimed at improving efficiency.
Abstract
Nearly all previous work on neural machine translation (NMT) has used quite restricted vocabularies, perhaps with a subsequent method to patch in unknown words. This paper presents a novel word-character solution to achieving open vocabulary NMT. We build hybrid systems that translate mostly at the word level and consult the character components for rare words. Our character-level recurrent neural networks compute source word representations and recover unknown target words when needed. The twofold advantage of such a hybrid approach is that it is much faster and easier to train than character-based ones; at the same time, it never produces unknown words as in the case of word-based models. On the WMT'15 English to Czech translation task, this hybrid approach offers an addition boost of +2.1-11.4 BLEU points over models that already handle unknown words. Our best system achieves a new state-of-the-art result with 20.7 BLEU score. We demonstrate that our character models can successfully learn to not only generate well-formed words for Czech, a highly-inflected language with a very complex vocabulary, but also build correct representations for English source words.
