Table of Contents
Fetching ...

Distributional Properties of Subword Regularization

Marco Cognetta, Vilém Zouhar, Naoaki Okazaki

TL;DR

This work analyzes the distributions induced by stochastic subword tokenizers (BPE-Dropout and MaxMatch-Dropout) and finds they are heavily biased toward a small set of tokenizations per word. It proposes Uniform Sampling of tokenizations, implemented via a finite-state transducer and acyclic lattice, as a drop-in replacement for dropout-based tokenization. Across English–German, English–Romanian, and English–French translation tasks, Uniform Sampling consistently improves translation quality (BLEU, chrF, COMET) compared to biased dropout variants, suggesting that unbiased tokenization distributions enhance subword regularization. The study highlights the potential for increased regularization and data augmentation through uniform tokenization sampling and calls for future work on achieving global uniformity and understanding entropy-related effects on learning.

Abstract

Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them. We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.

Distributional Properties of Subword Regularization

TL;DR

This work analyzes the distributions induced by stochastic subword tokenizers (BPE-Dropout and MaxMatch-Dropout) and finds they are heavily biased toward a small set of tokenizations per word. It proposes Uniform Sampling of tokenizations, implemented via a finite-state transducer and acyclic lattice, as a drop-in replacement for dropout-based tokenization. Across English–German, English–Romanian, and English–French translation tasks, Uniform Sampling consistently improves translation quality (BLEU, chrF, COMET) compared to biased dropout variants, suggesting that unbiased tokenization distributions enhance subword regularization. The study highlights the potential for increased regularization and data augmentation through uniform tokenization sampling and calls for future work on achieving global uniformity and understanding entropy-related effects on learning.

Abstract

Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them. We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.
Paper Structure (16 sections, 4 theorems, 9 figures, 7 tables)

This paper contains 16 sections, 4 theorems, 9 figures, 7 tables.

Key Result

Lemma 3.0

Let $\mathcal{B} = (\mathcal{V}, \mu)$ be a BPE tokenizer such that there exists $(a, b), (b, b), (b, c) \in \mu$ with $(a, b) >_{\mu} (b, b) >_{\mu} (b, c)$ and $abb, bbc, abbc \notin \mathcal{V}$. Then, there exists a word $w \in \Sigma^+$ for which the distribution of the dropout tokenizer $\math

Figures (9)

  • Figure 1: Uniformly sampling tokenizations from $\mathcal{A} \circ \mathcal{T}$.
  • Figure 2: The number of unique, observed tokenizations of a word with $N$ samples and dropout $p$.
  • Figure 3: Distribution uniformity measured by Shannon Efficiency (higher=more uniform; excludes the canonical form, which usually takes up most of the probability mass). Our Uniform Sampling versions (both for BPE and MaxMatch) guarantee balanced sampling of tokenizations, which is not true for the standard Dropout versions whose balance depends non-linearly on the dropout rate $p$.
  • Figure : BPE Inference (with dropout)
  • Figure : MaxMatch Inference (with dropout)
  • ...and 4 more figures

Theorems & Definitions (6)

  • Lemma 3.0
  • Lemma 3.0
  • Lemma C.0
  • proof
  • Lemma C.0
  • proof