Table of Contents
Fetching ...

On the Effect of (Near) Duplicate Subwords in Language Modelling

Anton Schäfer, Thomas Hofmann, Imanol Schlag, Tiago Pimentel

TL;DR

The paper investigates how (near) duplicate subwords in tokenisation affect language-model training efficiency. It introduces a controlled perfect-duplication setup to bound the cost of generalising across duplicates and compares it to natural near-duplication by deduplicating real subwords. Using information-theoretic constructs, it shows that perfect duplicates are largely interchangeable in theory but in practice yield about a 17% data-efficiency loss, while natural duplicates are not interchangeable and deduplication often hurts performance due to retained semantic differences. The results imply potential data-efficiency gains from better cross-duplicate generalisation (e.g., character-level models) only under ideal conditions; in real vocabularies, near duplicates limit such improvements.

Abstract

Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this by duplicating each subword in our LM's vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that merging them considerably hurts LM performance. Therefore, although subword duplication negatively impacts LM training efficiency, naturally occurring near duplicates may not be as similar as anticipated, limiting the potential for performance improvements.

On the Effect of (Near) Duplicate Subwords in Language Modelling

TL;DR

The paper investigates how (near) duplicate subwords in tokenisation affect language-model training efficiency. It introduces a controlled perfect-duplication setup to bound the cost of generalising across duplicates and compares it to natural near-duplication by deduplicating real subwords. Using information-theoretic constructs, it shows that perfect duplicates are largely interchangeable in theory but in practice yield about a 17% data-efficiency loss, while natural duplicates are not interchangeable and deduplication often hurts performance due to retained semantic differences. The results imply potential data-efficiency gains from better cross-duplicate generalisation (e.g., character-level models) only under ideal conditions; in real vocabularies, near duplicates limit such improvements.

Abstract

Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this by duplicating each subword in our LM's vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that merging them considerably hurts LM performance. Therefore, although subword duplication negatively impacts LM training efficiency, naturally occurring near duplicates may not be as similar as anticipated, limiting the potential for performance improvements.
Paper Structure (28 sections, 1 theorem, 22 equations, 8 figures, 7 tables)

This paper contains 28 sections, 1 theorem, 22 equations, 8 figures, 7 tables.

Key Result

Lemma 1

Let ${\color{MidnightBlue}\mathbb{S}}: {\color{purple}\Sigma} \to {\color{MidnightBlue}\overline{\Sigma}}$ be a deterministic function which maps duplicated subwords ${\color{purple}w} \in {\color{purple}\Sigma}$ to their deduplicated versions ${\color{MidnightBlue}c} \in {\color{MidnightBlue}\overl

Figures (8)

  • Figure 1: Left: Fitted power laws capturing the relationship between training data and $\mathrm{PPL}_{\color{MidnightBlue}\mathbb{S}}$. Our standard training set contains around $1.2$B tokens. Right: Data required to achieve the same performance with $p({\color{MidnightBlue}\boldsymbol{{\color{MidnightBlue}c}}})$ and $p({\color{purple}\boldsymbol{w}})$, computed based on the fitted scaling law curves. In the considered interval, this curve's slope---which roughly corresponds to $\frac{\text{number of training tokens for $p({\color{purple}\boldsymbol{w}})$}}{\text{number of training tokens for $p({\color{MidnightBlue}\boldsymbol{{\color{MidnightBlue}c}}})$}}$---is approximately equal to $\frac{1}{0.85}$.
  • Figure 2: Impact of duplication on $\mathrm{PPL}_{\color{MidnightBlue}\mathbb{S}}$ while varying the fraction of subwords in the vocabulary that are duplicated ($1.0$ corresponds to $\widehat{p}({\color{purple}\boldsymbol{w}})$, $0.0$ to $\widehat{p}({\color{MidnightBlue}\boldsymbol{{\color{MidnightBlue}c}}}$)). Lower $\mathrm{PPL}_{\color{MidnightBlue}\mathbb{S}}$ is better. When duplicating 70% of the vocabulary (which yields a 41% duplication rate in the final vocabulary, roughly the rate of near duplicates in real vocabularies), we obtain $\mathrm{PPL}_{\color{MidnightBlue}\mathbb{S}} \approx 22.4$; this is equivalent to a $\approx 10\%$ decrease in data efficiency.
  • Figure 3: Input embedding cosine similarity of duplicates ${\color{purple}w}_{\scaleto{}{7pt}}$, ${\color{purple}w}_{\scaleto{}{7pt}}'$, by frequency. Frequencies binned and similarities averaged per bin.
  • Figure 4: Duplication of half of the vocabulary: Analysing the mean surprisal difference per subword between $\widehat{p}({\color{purple}\boldsymbol{w}})$ and $\widehat{p}({\color{MidnightBlue}\boldsymbol{{\color{MidnightBlue}c}}})$. Frequencies are categorised into bins, with averages computed for each bin.
  • Figure 5: Duplication of half of the vocabulary. Difference between the surprisal assigned to each token by $\widehat{p}({\color{purple}\boldsymbol{w}})$ and $\widehat{p}({\color{MidnightBlue}\boldsymbol{{\color{MidnightBlue}c}}})$, depending on fraction $\frac{\text{duplicated subwords}}{\text{non-duplicated subwords}}$ in context. Fractions are categorised into bins, with average surprisal differences computed for each bin. "Support" shows the number of samples per bin.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Definition 1: Time-dependent Conditional Entropy and Mutual Information
  • Lemma 1
  • proof