Tokenisation is NP-Complete
Philip Whittington, Gregor Bachmann, Tiago Pimentel
TL;DR
This paper addresses the fundamental question of whether optimal tokenisers, under common compression objectives, can be found efficiently. It defines two formal tokenisation paradigms—direct tokenisation and bottom-up tokenisation—and proves that the associated decision problems are NP-complete by reducing max-$2$-SAT to each, thereby showing the likely intractability of exact solutions. The results imply that practical tokenisation should rely on approximation algorithms (e.g., BPE, UnigramLM) and that further theoretical work should explore other objective functions and connections to dictionary compression. Overall, the study clarifies a theoretical barrier to optimal tokeniser design and motivates algorithmic development around scalable, approximate approaches with compression-focused justification.
Abstract
In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $δ$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).
