Distributional Properties of Subword Regularization
Marco Cognetta, Vilém Zouhar, Naoaki Okazaki
TL;DR
This work analyzes the distributions induced by stochastic subword tokenizers (BPE-Dropout and MaxMatch-Dropout) and finds they are heavily biased toward a small set of tokenizations per word. It proposes Uniform Sampling of tokenizations, implemented via a finite-state transducer and acyclic lattice, as a drop-in replacement for dropout-based tokenization. Across English–German, English–Romanian, and English–French translation tasks, Uniform Sampling consistently improves translation quality (BLEU, chrF, COMET) compared to biased dropout variants, suggesting that unbiased tokenization distributions enhance subword regularization. The study highlights the potential for increased regularization and data augmentation through uniform tokenization sampling and calls for future work on achieving global uniformity and understanding entropy-related effects on learning.
Abstract
Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them. We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.
