Table of Contents
Fetching ...

Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation

Langlin Huang, Yang Feng

TL;DR

This work addresses the limits of fixed subword vocabularies in multilingual and out-of-domain NMT by advocating byte-based tokenization and introducing Multi-Scale Contextualization (MSC). MSC explicitly models contextual information at multiple scales by partitioning hidden representations into $n$ groups, applying scale-controlled functions $g_i(\cdot,k)$ with kernel sizes $k$, and then dynamically fusing the results via attention before the Multi-Head Attention. The approach yields consistent gains over other byte-based methods in multilingual and zero-shot cross-domain tasks, while revealing that subword models can still excel in English-centric settings; the gains are most pronounced when contextualization scales are balanced and tuned to language scripts. Overall, MSC enhances the adaptability and efficiency of byte-based NMT, offering practical benefits for low-resource languages and cross-domain translation, with publicly available code for replication.

Abstract

Subword tokenization is a common method for vocabulary building in Neural Machine Translation (NMT) models. However, increasingly complex tasks have revealed its disadvantages. First, a vocabulary cannot be modified once it is learned, making it hard to adapt to new words. Second, in multilingual translation, the imbalance in data volumes across different languages spreads to the vocabulary, exacerbating translations involving low-resource languages. While byte-based tokenization addresses these issues, byte-based models struggle with the low information density inherent in UTF-8 byte sequences. Previous works enhance token semantics through local contextualization but fail to select an appropriate contextualizing scope based on the input. Consequently, we propose the Multi-Scale Contextualization (MSC) method, which learns contextualized information of varying scales across different hidden state dimensions. It then leverages the attention module to dynamically integrate the multi-scale contextualized information. Experiments show that MSC significantly outperforms subword-based and other byte-based methods in both multilingual and out-of-domain scenarios. Code can be found in https://github.com/ictnlp/Multiscale-Contextualization.

Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation

TL;DR

This work addresses the limits of fixed subword vocabularies in multilingual and out-of-domain NMT by advocating byte-based tokenization and introducing Multi-Scale Contextualization (MSC). MSC explicitly models contextual information at multiple scales by partitioning hidden representations into groups, applying scale-controlled functions with kernel sizes , and then dynamically fusing the results via attention before the Multi-Head Attention. The approach yields consistent gains over other byte-based methods in multilingual and zero-shot cross-domain tasks, while revealing that subword models can still excel in English-centric settings; the gains are most pronounced when contextualization scales are balanced and tuned to language scripts. Overall, MSC enhances the adaptability and efficiency of byte-based NMT, offering practical benefits for low-resource languages and cross-domain translation, with publicly available code for replication.

Abstract

Subword tokenization is a common method for vocabulary building in Neural Machine Translation (NMT) models. However, increasingly complex tasks have revealed its disadvantages. First, a vocabulary cannot be modified once it is learned, making it hard to adapt to new words. Second, in multilingual translation, the imbalance in data volumes across different languages spreads to the vocabulary, exacerbating translations involving low-resource languages. While byte-based tokenization addresses these issues, byte-based models struggle with the low information density inherent in UTF-8 byte sequences. Previous works enhance token semantics through local contextualization but fail to select an appropriate contextualizing scope based on the input. Consequently, we propose the Multi-Scale Contextualization (MSC) method, which learns contextualized information of varying scales across different hidden state dimensions. It then leverages the attention module to dynamically integrate the multi-scale contextualized information. Experiments show that MSC significantly outperforms subword-based and other byte-based methods in both multilingual and out-of-domain scenarios. Code can be found in https://github.com/ictnlp/Multiscale-Contextualization.
Paper Structure (14 sections, 1 equation, 1 figure, 6 tables)

This paper contains 14 sections, 1 equation, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Multi-Scale Contextualization module: the input vector $x$, with hidden state dimension $dim_{model}$ and text length $l$, is divided into $n$ parts according to the hidden state dimensions, and then $n$ contextualizing functions with different scopes process these parts respectively. The output $\hat{x}$ now contains multi-scale information and acts as input to the Multi-Head Attention module.