Table of Contents
Fetching ...

Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models

Minh-Thang Luong, Christopher D. Manning

TL;DR

Open vocabulary neural machine translation remains challenging due to fixed vocabularies. The authors introduce a hybrid word-character architecture that preserves a fast word-level backbone while employing a source-character module for rare words and a character-level generator for unknown target forms, trained end-to-end. On WMT'15 English–Czech, this approach yields significant gains, achieving a state-of-the-art 20.7 BLEU with ensembles and strong chrF3 scores, while removing the need for separate unk replacement steps. The results demonstrate robust translation of Czech morphology and effective representations for English, and they show that purely character models can also be competitive, with future work aimed at improving efficiency.

Abstract

Nearly all previous work on neural machine translation (NMT) has used quite restricted vocabularies, perhaps with a subsequent method to patch in unknown words. This paper presents a novel word-character solution to achieving open vocabulary NMT. We build hybrid systems that translate mostly at the word level and consult the character components for rare words. Our character-level recurrent neural networks compute source word representations and recover unknown target words when needed. The twofold advantage of such a hybrid approach is that it is much faster and easier to train than character-based ones; at the same time, it never produces unknown words as in the case of word-based models. On the WMT'15 English to Czech translation task, this hybrid approach offers an addition boost of +2.1-11.4 BLEU points over models that already handle unknown words. Our best system achieves a new state-of-the-art result with 20.7 BLEU score. We demonstrate that our character models can successfully learn to not only generate well-formed words for Czech, a highly-inflected language with a very complex vocabulary, but also build correct representations for English source words.

Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models

TL;DR

Open vocabulary neural machine translation remains challenging due to fixed vocabularies. The authors introduce a hybrid word-character architecture that preserves a fast word-level backbone while employing a source-character module for rare words and a character-level generator for unknown target forms, trained end-to-end. On WMT'15 English–Czech, this approach yields significant gains, achieving a state-of-the-art 20.7 BLEU with ensembles and strong chrF3 scores, while removing the need for separate unk replacement steps. The results demonstrate robust translation of Czech morphology and effective representations for English, and they show that purely character models can also be competitive, with future work aimed at improving efficiency.

Abstract

Nearly all previous work on neural machine translation (NMT) has used quite restricted vocabularies, perhaps with a subsequent method to patch in unknown words. This paper presents a novel word-character solution to achieving open vocabulary NMT. We build hybrid systems that translate mostly at the word level and consult the character components for rare words. Our character-level recurrent neural networks compute source word representations and recover unknown target words when needed. The twofold advantage of such a hybrid approach is that it is much faster and easier to train than character-based ones; at the same time, it never produces unknown words as in the case of word-based models. On the WMT'15 English to Czech translation task, this hybrid approach offers an addition boost of +2.1-11.4 BLEU points over models that already handle unknown words. Our best system achieves a new state-of-the-art result with 20.7 BLEU score. We demonstrate that our character models can successfully learn to not only generate well-formed words for Czech, a highly-inflected language with a very complex vocabulary, but also build correct representations for English source words.

Paper Structure

This paper contains 18 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Hybrid NMT -- example of a word-character model for translating "a cute cat" into "un joli chat". Hybrid NMT translates at the word level. For rare tokens, the character-level components build source representations and recover target $<$ unk$>$. "_" marks sequence boundaries.
  • Figure 2: Attention mechanism.
  • Figure 3: Vocabulary size effect -- shown are the performances of different systems as we vary their vocabulary sizes. We highlight the improvements obtained by our hybrid models over word-based systems which already handle unknown words.
  • Figure 4: Barnes-Hut-SNE visualization of source word representations -- shown are sample words from the Rare Word dataset. We differentiate two types of embeddings: frequent words in which encoder embeddings are looked up directly and rare words where we build representations from characters. Boxes highlight examples that we will discuss in the text. We use the hybrid model (l) in this visualization.