Table of Contents
Fetching ...

Tokenisation is NP-Complete

Philip Whittington, Gregor Bachmann, Tiago Pimentel

TL;DR

This paper addresses the fundamental question of whether optimal tokenisers, under common compression objectives, can be found efficiently. It defines two formal tokenisation paradigms—direct tokenisation and bottom-up tokenisation—and proves that the associated decision problems are NP-complete by reducing max-$2$-SAT to each, thereby showing the likely intractability of exact solutions. The results imply that practical tokenisation should rely on approximation algorithms (e.g., BPE, UnigramLM) and that further theoretical work should explore other objective functions and connections to dictionary compression. Overall, the study clarifies a theoretical barrier to optimal tokeniser design and motivates algorithmic development around scalable, approximate approaches with compression-focused justification.

Abstract

In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $δ$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).

Tokenisation is NP-Complete

TL;DR

This paper addresses the fundamental question of whether optimal tokenisers, under common compression objectives, can be found efficiently. It defines two formal tokenisation paradigms—direct tokenisation and bottom-up tokenisation—and proves that the associated decision problems are NP-complete by reducing max--SAT to each, thereby showing the likely intractability of exact solutions. The results imply that practical tokenisation should rely on approximation algorithms (e.g., BPE, UnigramLM) and that further theoretical work should explore other objective functions and connections to dictionary compression. Overall, the study clarifies a theoretical barrier to optimal tokeniser design and motivates algorithmic development around scalable, approximate approaches with compression-focused justification.

Abstract

In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).

Paper Structure

This paper contains 22 sections, 14 theorems, 74 equations, 4 tables.

Key Result

Theorem 1

The direct tokenisation decision problem, as in defn:token_decision_problem, is NP-complete.

Theorems & Definitions (41)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1
  • proof
  • Lemma 1
  • proof
  • Lemma 2
  • proof : Proof sketch
  • Lemma 3
  • ...and 31 more