Table of Contents
Fetching ...

Self-Vocabularizing Training for Neural Machine Translation

Pin-Jie Lin, Ernie Chang, Yangyang Shi, Vikas Chandra

TL;DR

The paper investigates why standard BPE vocabularies may be suboptimal for neural machine translation by exposing a discrepancy between the original vocabulary $V_0$ and a self-induced vocabulary $V_1$ that emerges during self-training, with $|V_1|$ ~ 0.8 $|V_0|$ on IWSLT14 DE-EN. It introduces Iterative Self-Vocabularization, a loop where a model is trained on data segmented by the current vocabulary, generates pseudo-labels, derives a new vocabulary from those labels, and retrains with the updated vocabulary, until improvements plateau. The authors measure vocabulary shifts via entropy-based metrics, showing that corpus entropy decreases and vocabulary overlap shrinks as iterations proceed, while BLEU scores improve, sometimes by up to 1.3 BLEU after one iteration and more with subsequent iterations. They also demonstrate that deeper encoder architectures tend to reduce vocabulary overlap and increase token uniqueness, achieving 6–8% vocabulary compression, indicating practical gains in efficiency and translation quality. Overall, the work suggests rethinking vocabulary induction during training and presents a principled, entropy-aware self-training approach that yields more compact, effective vocabularies for NMT.

Abstract

Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training. Empirically, we observe that trained translation models are induced to use a byte-pair encoding (BPE) vocabulary subset distinct from the original BPE vocabulary, leading to performance improvements when retrained with the induced vocabulary. In this paper, we analyze this discrepancy in neural machine translation by examining vocabulary and entropy shifts during self-training--where each iteration generates a labeled dataset by pairing source sentences with the model's predictions to define a new vocabulary. Building on these insights, we propose self-vocabularizing training, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement. Moreover, we find that deeper model architectures lead to both an increase in unique token usage and a 6-8% reduction in vocabulary size.

Self-Vocabularizing Training for Neural Machine Translation

TL;DR

The paper investigates why standard BPE vocabularies may be suboptimal for neural machine translation by exposing a discrepancy between the original vocabulary and a self-induced vocabulary that emerges during self-training, with ~ 0.8 on IWSLT14 DE-EN. It introduces Iterative Self-Vocabularization, a loop where a model is trained on data segmented by the current vocabulary, generates pseudo-labels, derives a new vocabulary from those labels, and retrains with the updated vocabulary, until improvements plateau. The authors measure vocabulary shifts via entropy-based metrics, showing that corpus entropy decreases and vocabulary overlap shrinks as iterations proceed, while BLEU scores improve, sometimes by up to 1.3 BLEU after one iteration and more with subsequent iterations. They also demonstrate that deeper encoder architectures tend to reduce vocabulary overlap and increase token uniqueness, achieving 6–8% vocabulary compression, indicating practical gains in efficiency and translation quality. Overall, the work suggests rethinking vocabulary induction during training and presents a principled, entropy-aware self-training approach that yields more compact, effective vocabularies for NMT.

Abstract

Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training. Empirically, we observe that trained translation models are induced to use a byte-pair encoding (BPE) vocabulary subset distinct from the original BPE vocabulary, leading to performance improvements when retrained with the induced vocabulary. In this paper, we analyze this discrepancy in neural machine translation by examining vocabulary and entropy shifts during self-training--where each iteration generates a labeled dataset by pairing source sentences with the model's predictions to define a new vocabulary. Building on these insights, we propose self-vocabularizing training, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement. Moreover, we find that deeper model architectures lead to both an increase in unique token usage and a 6-8% reduction in vocabulary size.

Paper Structure

This paper contains 15 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of self-vocabularizing training: At each iteration, the original dataset $D_0$ is segmented using vocabulary $V_t$ to form the training set $D_t$. $D_t$ is then used to train model $M_t$, which generates a pseudo dataset $D'$. A new vocabulary set $V_{t+1}$ is derived from $D'$, completing the training loop. This process repeats until no further improvements are observed.
  • Figure 2: Entropy and performance across self-vocabularizing training iterations. (Left) BLEU score (blue) consistently improves across iterations. Meanwhile, the self-learned vocabulary reduces corpus entropy (teal), indicating a better estimation of token distribution. (Right) Vocabulary shift measured by vocabulary overlap (orange) between consecutive vocabularies $V_t$ and $V_{t-1}$, showing that the model initially selects a broad set of subwords before consolidating onto a subset of $V_{t-1}$. The type-token ratio (TTR) (purple) reflects the diversity of learned semantic units, reported on the training corpus scaled by $1000$.
  • Figure 3: Performance and vocabulary overlap across models with different encoder and decoder depths. (Left) As the number of encoder (teal -) or decoder (teal - -) layers increases, BLEU scores consistently improve. However, vocabulary overlap decreases for deeper encoder (blue -) or decoder (blue - -) layers, indicating that deeper models tend to use more unique tokens. (Right) Vocabulary compression (VC) across models with varying depths. All models trained with self-vocabularizing training effectively compress the token set. Notably, deeper encoder models (purple) exhibit a smoother reduction in VC rates, whereas deeper decoder models (orange) require more tokens for inference. VC is reported on the test set using models of different depths in either the encoder or decoder, with a single round of self-vocabularizing training.
  • Figure 4: Impact of self-vocabularizing training on IWSLT14 EN-DE. (Left) BLEU scores improve consistently across iterations, while corpus entropy decreases, indicating more stable and predictable token distributions. (Right) Vocabulary overlap reduces as the model gradually refines its subword selection, while the type-token ratio (TTR) reflects evolving semantic diversity.