A Formal Perspective on Byte-Pair Encoding
Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, Ryan Cotterell
TL;DR
The paper addresses the lack of formal guarantees for Byte-Pair Encoding by recasting BPE training as a constrained submodular maximization over valid merge sequences, with the objective defined as a compression utility $\kappa_{\boldsymbol{x}}(\boldsymbol{\mu}) = |\boldsymbol{x}| - |\textsc{Apply}_{\boldsymbol{\mu}}(\boldsymbol{x})|$. It proves that the greedy BPE algorithm achieves a guaranteed approximation of $\frac{1}{\sigma(\boldsymbol{\mu}^\star)}\left(1 - e^{-\sigma(\boldsymbol{\mu}^\star)}\right)$ with respect to the optimal sequence, where $\sigma(\boldsymbol{\mu}^\star)$ is the total backward curvature, and it introduces both a faster $O(N \log M)$ runtime and an exact memoized solver. The work additionally formalizes hierarchical submodularity to accommodate the merge-structure constraints and provides empirical estimates of $\hat{\sigma}$ on synthetic and natural-language data, suggesting nontrivial but practical performance guarantees. Overall, the results offer theoretical grounding for BPE in NLP, along with scalable implementations and an exact algorithm for cases where exact optimality is required. These contributions provide practitioners with principled expectations about BPE performance and practical tools for efficient subword vocabulary learning.
Abstract
Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{σ(\boldsymbolμ^\star)}(1-e^{-{σ(\boldsymbolμ^\star)}})$-approximation of an optimal merge sequence, where ${σ(\boldsymbolμ^\star)}$ is the total backward curvature with respect to the optimal merge sequence $\boldsymbolμ^\star$. Empirically the lower bound of the approximation is $\approx 0.37$. We provide a faster implementation of BPE which improves the runtime complexity from $\mathcal{O}\left(N M\right)$ to $\mathcal{O}\left(N \log M\right)$, where $N$ is the sequence length and $M$ is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.
