Table of Contents
Fetching ...

A Formal Perspective on Byte-Pair Encoding

Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, Ryan Cotterell

TL;DR

The paper addresses the lack of formal guarantees for Byte-Pair Encoding by recasting BPE training as a constrained submodular maximization over valid merge sequences, with the objective defined as a compression utility $\kappa_{\boldsymbol{x}}(\boldsymbol{\mu}) = |\boldsymbol{x}| - |\textsc{Apply}_{\boldsymbol{\mu}}(\boldsymbol{x})|$. It proves that the greedy BPE algorithm achieves a guaranteed approximation of $\frac{1}{\sigma(\boldsymbol{\mu}^\star)}\left(1 - e^{-\sigma(\boldsymbol{\mu}^\star)}\right)$ with respect to the optimal sequence, where $\sigma(\boldsymbol{\mu}^\star)$ is the total backward curvature, and it introduces both a faster $O(N \log M)$ runtime and an exact memoized solver. The work additionally formalizes hierarchical submodularity to accommodate the merge-structure constraints and provides empirical estimates of $\hat{\sigma}$ on synthetic and natural-language data, suggesting nontrivial but practical performance guarantees. Overall, the results offer theoretical grounding for BPE in NLP, along with scalable implementations and an exact algorithm for cases where exact optimality is required. These contributions provide practitioners with principled expectations about BPE performance and practical tools for efficient subword vocabulary learning.

Abstract

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{σ(\boldsymbolμ^\star)}(1-e^{-{σ(\boldsymbolμ^\star)}})$-approximation of an optimal merge sequence, where ${σ(\boldsymbolμ^\star)}$ is the total backward curvature with respect to the optimal merge sequence $\boldsymbolμ^\star$. Empirically the lower bound of the approximation is $\approx 0.37$. We provide a faster implementation of BPE which improves the runtime complexity from $\mathcal{O}\left(N M\right)$ to $\mathcal{O}\left(N \log M\right)$, where $N$ is the sequence length and $M$ is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.

A Formal Perspective on Byte-Pair Encoding

TL;DR

The paper addresses the lack of formal guarantees for Byte-Pair Encoding by recasting BPE training as a constrained submodular maximization over valid merge sequences, with the objective defined as a compression utility . It proves that the greedy BPE algorithm achieves a guaranteed approximation of with respect to the optimal sequence, where is the total backward curvature, and it introduces both a faster runtime and an exact memoized solver. The work additionally formalizes hierarchical submodularity to accommodate the merge-structure constraints and provides empirical estimates of on synthetic and natural-language data, suggesting nontrivial but practical performance guarantees. Overall, the results offer theoretical grounding for BPE in NLP, along with scalable implementations and an exact algorithm for cases where exact optimality is required. These contributions provide practitioners with principled expectations about BPE performance and practical tools for efficient subword vocabulary learning.

Abstract

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a -approximation of an optimal merge sequence, where is the total backward curvature with respect to the optimal merge sequence . Empirically the lower bound of the approximation is . We provide a faster implementation of BPE which improves the runtime complexity from to , where is the sequence length and is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.
Paper Structure (20 sections, 11 theorems, 19 equations, 5 figures, 3 tables, 3 algorithms)

This paper contains 20 sections, 11 theorems, 19 equations, 5 figures, 3 tables, 3 algorithms.

Key Result

Proposition 3.2

Let $\kappa_{\boldsymbol{x}}$ be the compression utility function. Then, for a fixed $\boldsymbol{x} \in \Sigma^*$, $\kappa_{\boldsymbol{x}}(\cdot)$ is monotone (def:monotone).

Figures (5)

  • Figure 1: Application of the merge sequence $\boldsymbol{\mu} = \langle [a,b], [c,b], [[a,b],a], [[[a,b],a],[c,b]]\rangle$ on the string $\boldsymbol{x} = abaabacbcb$. The result can be represented as an ordered forest. Each tree is associated with a subword in the text: $aba$, $abacb$, and $cb$.
  • Figure 2: Visualization of linked list representation of the string and the associated priority queue (frequency values in dashed boxes) with merges. Nodes in red will be removed in the next step, nodes in green were added in contrast to the previous step and nodes in purple were just added but will be removed. Black lines from queue to the string show which nodes to merge. Grey lines show which pairs in the priority queue will have reduced frequencies.
  • Figure 3: Comparison of runtimes for brute-force DFS and DFS with memoization. Values above 1 correspond to DFS+memoization being $\times$ faster than DFS. Points show averageof runs on 5 different input strings (each 2 randomly sampled English sentences of 64 characters).
  • Figure : A minimal implementation of sennrich-etal-2016-neural's (sennrich-etal-2016-neural) greedy algorithm for BPE in Python. See \ref{['code:fixed_bpe']} for a version with overlap-adjusted counts.
  • Figure : An implementation of sennrich-etal-2016-neural's (sennrich-etal-2016-neural) greedy algorithm for BPE in Python with overlap-adjusted pair counts.

Theorems & Definitions (40)

  • Definition 2.1
  • Example 2.2
  • Definition 2.3
  • Definition 2.4
  • Definition 2.5
  • Definition 2.6
  • Definition 3.1
  • Proposition 3.2
  • proof
  • Definition 3.3
  • ...and 30 more