Self-Vocabularizing Training for Neural Machine Translation
Pin-Jie Lin, Ernie Chang, Yangyang Shi, Vikas Chandra
TL;DR
The paper investigates why standard BPE vocabularies may be suboptimal for neural machine translation by exposing a discrepancy between the original vocabulary $V_0$ and a self-induced vocabulary $V_1$ that emerges during self-training, with $|V_1|$ ~ 0.8 $|V_0|$ on IWSLT14 DE-EN. It introduces Iterative Self-Vocabularization, a loop where a model is trained on data segmented by the current vocabulary, generates pseudo-labels, derives a new vocabulary from those labels, and retrains with the updated vocabulary, until improvements plateau. The authors measure vocabulary shifts via entropy-based metrics, showing that corpus entropy decreases and vocabulary overlap shrinks as iterations proceed, while BLEU scores improve, sometimes by up to 1.3 BLEU after one iteration and more with subsequent iterations. They also demonstrate that deeper encoder architectures tend to reduce vocabulary overlap and increase token uniqueness, achieving 6–8% vocabulary compression, indicating practical gains in efficiency and translation quality. Overall, the work suggests rethinking vocabulary induction during training and presents a principled, entropy-aware self-training approach that yields more compact, effective vocabularies for NMT.
Abstract
Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training. Empirically, we observe that trained translation models are induced to use a byte-pair encoding (BPE) vocabulary subset distinct from the original BPE vocabulary, leading to performance improvements when retrained with the induced vocabulary. In this paper, we analyze this discrepancy in neural machine translation by examining vocabulary and entropy shifts during self-training--where each iteration generates a labeled dataset by pairing source sentences with the model's predictions to define a new vocabulary. Building on these insights, we propose self-vocabularizing training, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement. Moreover, we find that deeper model architectures lead to both an increase in unique token usage and a 6-8% reduction in vocabulary size.
