Table of Contents
Fetching ...

Byte BPE Tokenization as an Inverse string Homomorphism

Saibo Geng, Sankalp Gambhir, Chris Wendler, Robert West

TL;DR

This work formalizes tokenization as an inverse-homomorphism from the character space to the token-ID space, with detokenization acting as a homomorphism back to the character space. It shows that, for context-free and regular languages, extended tokenization preserves the original language structure, and that Unicode can be accommodated via byte-level tokenization without breaking this property. The authors construct a PDA-based framework to recognize token languages and introduce the notions of proper versus extended tokenization, highlighting how the latter suffices for neural model expressiveness while addressing practical issues like BPE and leading-space effects. Overall, the paper provides a rigorous, language-theoretic lens on tokenization, offering insights for tokenizer design and the interpretability of LLMs with respect to formal-language processing.

Abstract

Tokenization is an important preprocessing step in the training and inference of large language models (LLMs). While there has been extensive research on the expressive power of the neural achitectures used in LLMs, the impact of tokenization has not been well understood. In this work, we demonstrate that tokenization, irrespective of the algorithm used, acts as an inverse homomorphism between strings and tokens. This suggests that the character space of the source language and the token space of the tokenized language are homomorphic, preserving the structural properties of the source language. Additionally, we explore the concept of proper tokenization, which refers to an unambiguous tokenization returned from the tokenizer. Our analysis reveals that the expressiveness of neural architectures in recognizing context-free languages is not affected by tokenization.

Byte BPE Tokenization as an Inverse string Homomorphism

TL;DR

This work formalizes tokenization as an inverse-homomorphism from the character space to the token-ID space, with detokenization acting as a homomorphism back to the character space. It shows that, for context-free and regular languages, extended tokenization preserves the original language structure, and that Unicode can be accommodated via byte-level tokenization without breaking this property. The authors construct a PDA-based framework to recognize token languages and introduce the notions of proper versus extended tokenization, highlighting how the latter suffices for neural model expressiveness while addressing practical issues like BPE and leading-space effects. Overall, the paper provides a rigorous, language-theoretic lens on tokenization, offering insights for tokenizer design and the interpretability of LLMs with respect to formal-language processing.

Abstract

Tokenization is an important preprocessing step in the training and inference of large language models (LLMs). While there has been extensive research on the expressive power of the neural achitectures used in LLMs, the impact of tokenization has not been well understood. In this work, we demonstrate that tokenization, irrespective of the algorithm used, acts as an inverse homomorphism between strings and tokens. This suggests that the character space of the source language and the token space of the tokenized language are homomorphic, preserving the structural properties of the source language. Additionally, we explore the concept of proper tokenization, which refers to an unambiguous tokenization returned from the tokenizer. Our analysis reveals that the expressiveness of neural architectures in recognizing context-free languages is not affected by tokenization.

Paper Structure

This paper contains 19 sections, 11 theorems, 5 equations, 3 figures, 1 table.

Key Result

Theorem 2.1

For every context-free grammar $G$, there exists a pushdown automaton $M$ that accepts the language $L(G)$.

Figures (3)

  • Figure 1: Tokenization and Detokenization example illustrating the homomorphism property in with OpenAI GPT-2's Tokenizer.
  • Figure 2: Tokenization Output for Nested Brackets Using LLaMA-2 Tokenizer
  • Figure 3: Construction of a PDA $M^\prime$ to accept language $h^{-1}(L)$. In the context of LLM, the input $a$ is a token ID, the homomorphism $h$ is detokenization, the buffer is used to store the token $h(a)$, the PDA state is the current state of the PDA in the character space, and the PDA stack is the stack of the PDA in the character space.

Theorems & Definitions (21)

  • Definition 2.1: Context-free Grammar
  • Definition 2.2: Formal Language
  • Definition 2.3: Pushdown Automaton
  • Theorem 2.1: Pushdown Automaton and Context-free Grammar
  • Definition 2.4: String Homomorphism
  • Definition 2.5: Inverse Homomorphism
  • Theorem 2.2: Closure under Inverse Homomorphism
  • Theorem 2.3: Closure under intersection
  • Definition 2.6: Tokenization
  • Definition 2.7: Detokenization
  • ...and 11 more