Table of Contents
Fetching ...

Fully Character-Level Neural Machine Translation without Explicit Segmentation

Jason Lee, Kyunghyun Cho, Thomas Hofmann

TL;DR

This work eliminates explicit word segmentation in neural machine translation by introducing a fully character-level, encoder-decoder framework (char2char) that uses a convolutional, max-pooling encoder to shorten the input while preserving local patterns. The model demonstrates competitive or superior performance to subword baselines across four language pairs and achieves notable gains in multilingual many-to-one translation, where a single character-level encoder effectively shares capacity across languages and handles code-switching. The results include strong BLEU and human-evaluated fluency advantages, highlighting the approach's efficiency, transferability, and open-vocabulary capabilities. Overall, the study argues that character-level translation is a viable and advantageous direction for multilingual MT systems and future many-to-many extensions.

Abstract

Most existing machine translation systems operate at the level of words, relying on explicit segmentation to extract tokens. We introduce a neural machine translation (NMT) model that maps a source character sequence to a target character sequence without any segmentation. We employ a character-level convolutional network with max-pooling at the encoder to reduce the length of source representation, allowing the model to be trained at a speed comparable to subword-level models while capturing local regularities. Our character-to-character model outperforms a recently proposed baseline with a subword-level encoder on WMT'15 DE-EN and CS-EN, and gives comparable performance on FI-EN and RU-EN. We then demonstrate that it is possible to share a single character-level encoder across multiple languages by training a model on a many-to-one translation task. In this multilingual setting, the character-level encoder significantly outperforms the subword-level encoder on all the language pairs. We observe that on CS-EN, FI-EN and RU-EN, the quality of the multilingual character-level translation even surpasses the models specifically trained on that language pair alone, both in terms of BLEU score and human judgment.

Fully Character-Level Neural Machine Translation without Explicit Segmentation

TL;DR

This work eliminates explicit word segmentation in neural machine translation by introducing a fully character-level, encoder-decoder framework (char2char) that uses a convolutional, max-pooling encoder to shorten the input while preserving local patterns. The model demonstrates competitive or superior performance to subword baselines across four language pairs and achieves notable gains in multilingual many-to-one translation, where a single character-level encoder effectively shares capacity across languages and handles code-switching. The results include strong BLEU and human-evaluated fluency advantages, highlighting the approach's efficiency, transferability, and open-vocabulary capabilities. Overall, the study argues that character-level translation is a viable and advantageous direction for multilingual MT systems and future many-to-many extensions.

Abstract

Most existing machine translation systems operate at the level of words, relying on explicit segmentation to extract tokens. We introduce a neural machine translation (NMT) model that maps a source character sequence to a target character sequence without any segmentation. We employ a character-level convolutional network with max-pooling at the encoder to reduce the length of source representation, allowing the model to be trained at a speed comparable to subword-level models while capturing local regularities. Our character-to-character model outperforms a recently proposed baseline with a subword-level encoder on WMT'15 DE-EN and CS-EN, and gives comparable performance on FI-EN and RU-EN. We then demonstrate that it is possible to share a single character-level encoder across multiple languages by training a model on a many-to-one translation task. In this multilingual setting, the character-level encoder significantly outperforms the subword-level encoder on all the language pairs. We observe that on CS-EN, FI-EN and RU-EN, the quality of the multilingual character-level translation even surpasses the models specifically trained on that language pair alone, both in terms of BLEU score and human judgment.

Paper Structure

This paper contains 24 sections, 5 equations, 2 figures, 13 tables.

Figures (2)

  • Figure 1: Encoder architecture schematics. Underscore denotes padding. A dotted vertical line delimits each segment. The stride of pooling $s$ is 5 in the diagram.
  • Figure 2: Multilingual models overfit less than bilingual models on low-resource language pairs.