Table of Contents
Fetching ...

Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT

Benoist Wolleb, Romain Silvestri, Giorgos Vernikos, Ljiljana Dolamic, Andrei Popescu-Belis

TL;DR

This paper investigates whether word frequency or subword compositionality primarily drives the effectiveness of subword tokenization in neural machine translation. It introduces a Huffman-coding-based tokenization that encodes words by their frequency into a fixed symbol vocabulary, thereby decoupling frequency effects from compositionality and enabling a direct comparison to BPE. Across three language pairs, the results show that frequency alone accounts for the majority of MT gains (around 90–95% of BLEU) and that Huffman tokenization closely tracks BPE as the symbol budget grows, with a residual gap attributed to compositionality and unknown-word handling. The findings challenge the claimed centrality of subword compositionality for MT gains and suggest that frequency-driven encoding captures most of BPE's effectiveness, informing future tokenization design and exploration of alternative compression-inspired methods.

Abstract

Subword tokenization is the de facto standard for tokenization in neural language models and machine translation systems. Three advantages are frequently cited in favor of subwords: shorter encoding of frequent tokens, compositionality of subwords, and ability to deal with unknown words. As their relative importance is not entirely clear yet, we propose a tokenization approach that enables us to separate frequency (the first advantage) from compositionality. The approach uses Huffman coding to tokenize words, by order of frequency, using a fixed amount of symbols. Experiments with CS-DE, EN-FR and EN-DE NMT show that frequency alone accounts for 90%-95% of the scores reached by BPE, hence compositionality has less importance than previously thought.

Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT

TL;DR

This paper investigates whether word frequency or subword compositionality primarily drives the effectiveness of subword tokenization in neural machine translation. It introduces a Huffman-coding-based tokenization that encodes words by their frequency into a fixed symbol vocabulary, thereby decoupling frequency effects from compositionality and enabling a direct comparison to BPE. Across three language pairs, the results show that frequency alone accounts for the majority of MT gains (around 90–95% of BLEU) and that Huffman tokenization closely tracks BPE as the symbol budget grows, with a residual gap attributed to compositionality and unknown-word handling. The findings challenge the claimed centrality of subword compositionality for MT gains and suggest that frequency-driven encoding captures most of BPE's effectiveness, informing future tokenization design and exploration of alternative compression-inspired methods.

Abstract

Subword tokenization is the de facto standard for tokenization in neural language models and machine translation systems. Three advantages are frequently cited in favor of subwords: shorter encoding of frequent tokens, compositionality of subwords, and ability to deal with unknown words. As their relative importance is not entirely clear yet, we propose a tokenization approach that enables us to separate frequency (the first advantage) from compositionality. The approach uses Huffman coding to tokenize words, by order of frequency, using a fixed amount of symbols. Experiments with CS-DE, EN-FR and EN-DE NMT show that frequency alone accounts for 90%-95% of the scores reached by BPE, hence compositionality has less importance than previously thought.
Paper Structure (12 sections, 2 figures, 5 tables, 1 algorithm)

This paper contains 12 sections, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Ternary Huffman tree illustrating our approach. The tree is built with Algorithm \ref{['alg:bhrigu']} from word frequencies, shown as indices in the mapping (right), based on the following text: "the house is on the hill, the house is blue, the sky is blue."
  • Figure 2: Histograms of the number of tokens from the CS data that are segmented into 1, 2, or more symbols, for Huffman coding (left) vs. BPE (right). Six different vocabulary sizes are shown for Huffman coding (from 1k to 32k symbols) and five for BPE (from 2k to 32k merges). While Huffman coding uses at most 4 symbols per token, BPE may use up to 10 subwords.