Table of Contents
Fetching ...

Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

Saumitra Yadav, Manish Shrivastava

TL;DR

This work critiques the common use of symmetric BPE in bilingual MT and introduces asymmetry in subword segmentation by varying the merge operations for source and target languages, denoted $NMO$ as $m_1$ and $m_2$. Through an extensive, data-size–aware exploration across English–Hindi and six additional language pairs, it shows that asymmetric BPE yields significant gains in low-resource regimes, with optimal patterns where the source side has a larger NMO than the target ($m_1$ > $m_2$). Domain tests (AI and Chemistry) and FLORES evaluations corroborate the trend, though gains diminish as resource levels rise. The findings advocate for tailored, language- and data-dependent tokenization strategies, with potential extensions to multilingual and fairness-aware tokenization approaches.

Abstract

Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers for both source and target languages. However, we demonstrate that this uniform approach doesn't guarantee optimal MT performance across different language pairs and data sizes. This work investigates BPE segmentation recipes across various data volumes and language pairs to evaluate MT system performance. We find that utilizing asymmetric BPE, where the source and target languages have different NMOs, significantly improves results over the symmetric approach, especially in low-resource settings (50K, 100K, and 500K sentence pairs). Specifically, asymmetric BPE yield statistically significant ($p<0.05$) average gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi in low-resource setups (50K, 100K, and 500K sentence pairs, respectively). We validated this trend across six additional language pairs (English and Telugu, Shona, Norwegian, Kyrgyz, Hausa, and Inuktitut), observing statistically significant improvement in 10 out of 12 systems compared to symmetric BPE. Our findings indicate a high NMO for the source (4K to 32K) and a low NMO for the target (0.5K to 2K) provides optimal results, particularly benefiting low-resource MT.

Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

TL;DR

This work critiques the common use of symmetric BPE in bilingual MT and introduces asymmetry in subword segmentation by varying the merge operations for source and target languages, denoted as and . Through an extensive, data-size–aware exploration across English–Hindi and six additional language pairs, it shows that asymmetric BPE yields significant gains in low-resource regimes, with optimal patterns where the source side has a larger NMO than the target ( > ). Domain tests (AI and Chemistry) and FLORES evaluations corroborate the trend, though gains diminish as resource levels rise. The findings advocate for tailored, language- and data-dependent tokenization strategies, with potential extensions to multilingual and fairness-aware tokenization approaches.

Abstract

Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers for both source and target languages. However, we demonstrate that this uniform approach doesn't guarantee optimal MT performance across different language pairs and data sizes. This work investigates BPE segmentation recipes across various data volumes and language pairs to evaluate MT system performance. We find that utilizing asymmetric BPE, where the source and target languages have different NMOs, significantly improves results over the symmetric approach, especially in low-resource settings (50K, 100K, and 500K sentence pairs). Specifically, asymmetric BPE yield statistically significant () average gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi in low-resource setups (50K, 100K, and 500K sentence pairs, respectively). We validated this trend across six additional language pairs (English and Telugu, Shona, Norwegian, Kyrgyz, Hausa, and Inuktitut), observing statistically significant improvement in 10 out of 12 systems compared to symmetric BPE. Our findings indicate a high NMO for the source (4K to 32K) and a low NMO for the target (0.5K to 2K) provides optimal results, particularly benefiting low-resource MT.

Paper Structure

This paper contains 17 sections, 15 figures, 12 tables.

Figures (15)

  • Figure 1: CHRF++ Scores for Symmetrical BPE (32K,4K) vs Asymmetrical BPE ($m_1$$\neq$$m_2$)
  • Figure 2: Changes in Optimal BPE Configuration from Low- to High-Resource Settings
  • Figure 3: CHRF++ scores for 0.1M sentence pairs for Hindi-to-English MT systems using configurations of the form 16K_x, where $x \in$ {500, 1K, 2K, 4K, 8K, 16K, 25K, 32K}.
  • Figure 4: CHRF++ scores for 0.1M sentence pairs for English-to-Hindi MT systems using configurations of the form 16K_x, where $x \in$ {500, 1K, 2K, 4K, 8K, 16K, 25K, 32K}.
  • Figure 5: CHRF++ score comparison of Asymmetric BPE with VOLT for English to Hindi
  • ...and 10 more figures