Character-based Neural Machine Translation
Marta R. Costa-Jussà, José A. R. Fonollosa
TL;DR
This work addresses the vocabulary and morphological challenges of neural machine translation by introducing character-based source word embeddings constructed via a convolutional neural network and highway layers, replacing standard word lookup embeddings. The character-based representations are integrated into an attention-based encoder–decoder framework, yielding an unlimited source vocabulary and better handling of affixes. Empirical results on German–English WMT show BLEU gains up to about 3 points, driven by reduced unknowns and improved morphology handling, with additional gains when postprocessing UNKs. The approach demonstrates a practical path to more scalable and morphologically aware NMT, with potential extension to target-side representations in future work.
Abstract
Neural Machine Translation (MT) has reached state-of-the-art results. However, one of the main challenges that neural MT still faces is dealing with very large vocabularies and morphologically rich languages. In this paper, we propose a neural MT system using character-based embeddings in combination with convolutional and highway layers to replace the standard lookup-based word representations. The resulting unlimited-vocabulary and affix-aware source word embeddings are tested in a state-of-the-art neural MT based on an attention-based bidirectional recurrent neural network. The proposed MT scheme provides improved results even when the source language is not morphologically rich. Improvements up to 3 BLEU points are obtained in the German-English WMT task.
