Table of Contents
Fetching ...

Tokenisation over Bounded Alphabets is Hard

Violeta Kastreva, Philip Whittington, Dennis Komm, Tiago Pimentel

TL;DR

This work analyzes the computational complexity of tokenisation when the alphabet is bounded, addressing two natural tokenisation variants: direct and bottom-up. It proves strong hardness results: binary direct and binary bottom-up tokenisation are NP-hard to decide and not in PTAS, with explicit gap-hardness bounds that rule out near-optimal approximations; unary direct tokenisation is strongly NP-complete, showing hardness even in the simplest alphabet, while a unary bottom-up variant is at least weakly NP-hard. The hardness results draw on reductions from 3-Occur MAX2SAT and related constructions, and extend to gap problems that imply inapproximability unless $\mathrm{P}=\mathrm{NP}$. These findings explain why practical tokenisers like BPE and UnigramLM rely on heuristics and motivate the search for provably good approximation algorithms or relaxations. Overall, the paper establishes that tokenisation hardness is intrinsic and persists even under tightly bounded alphabets, guiding future research toward approximation and tractable relaxations rather than exact optimisation.

Abstract

Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode characters. We close this gap by analysing tokenisation over bounded $n$-ary alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. First, we note that proving hardness results for an $n$-ary alphabet proves the same results for alphabets of any larger size. We then prove that even with binary alphabets, both variants are not only NP-complete, but admit no polynomial-time approximation scheme (unless P=NP). We further show that direct tokenisation remains NP-complete even when applied to unary alphabets. While unary alphabets may not be practically useful, this result establishes that the computational intractability of tokenisation is not an artifact of large alphabets or complex constructions, but a fundamental barrier. Overall, our results explain why practical algorithms such as BPE and UnigramLM are heuristic, and points toward approximation algorithms being an important path going forward for tokenisation research.

Tokenisation over Bounded Alphabets is Hard

TL;DR

This work analyzes the computational complexity of tokenisation when the alphabet is bounded, addressing two natural tokenisation variants: direct and bottom-up. It proves strong hardness results: binary direct and binary bottom-up tokenisation are NP-hard to decide and not in PTAS, with explicit gap-hardness bounds that rule out near-optimal approximations; unary direct tokenisation is strongly NP-complete, showing hardness even in the simplest alphabet, while a unary bottom-up variant is at least weakly NP-hard. The hardness results draw on reductions from 3-Occur MAX2SAT and related constructions, and extend to gap problems that imply inapproximability unless . These findings explain why practical tokenisers like BPE and UnigramLM rely on heuristics and motivate the search for provably good approximation algorithms or relaxations. Overall, the paper establishes that tokenisation hardness is intrinsic and persists even under tightly bounded alphabets, guiding future research toward approximation and tractable relaxations rather than exact optimisation.

Abstract

Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode characters. We close this gap by analysing tokenisation over bounded -ary alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. First, we note that proving hardness results for an -ary alphabet proves the same results for alphabets of any larger size. We then prove that even with binary alphabets, both variants are not only NP-complete, but admit no polynomial-time approximation scheme (unless P=NP). We further show that direct tokenisation remains NP-complete even when applied to unary alphabets. While unary alphabets may not be practically useful, this result establishes that the computational intractability of tokenisation is not an artifact of large alphabets or complex constructions, but a fundamental barrier. Overall, our results explain why practical algorithms such as BPE and UnigramLM are heuristic, and points toward approximation algorithms being an important path going forward for tokenisation research.

Paper Structure

This paper contains 33 sections, 17 theorems, 105 equations, 1 table.

Key Result

Theorem 1

The binary direct tokenisation decision problem is $\mathsf{NP}$-complete.

Theorems & Definitions (60)

  • Definition 1
  • proof
  • Definition 2
  • Theorem 1
  • proof : Proof sketch
  • Theorem 2
  • proof : Proof sketch
  • Theorem 3
  • proof : Proof sketch
  • Theorem 4
  • ...and 50 more