Table of Contents
Fetching ...

Tokenization as Finite-State Transduction

Marco Cognetta, Naoaki Okazaki

TL;DR

From first principles a finite-state transduction framework is introduced which can encode all possible tokenizations of a regular language and it is constructively shown that Byte-Pair Encoding and MaxMatch, two popular tokenization schemes, are also efficiently representable by simple finite-state transducers.

Abstract

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, fit within this framework. For BPE, this is particularly surprising given its resemblance to context-free grammar and the fact that it does not tokenize strings from left to right. An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern. Here, patterns are encoded at the character level, which creates a mismatch between the constraints and the model's subword vocabulary. While past work has focused only on constraining outputs without regard to the underlying tokenization algorithm, our framework allows for simultaneously constraining the model outputs to match a specified pattern while also adhering to the underlying tokenizer's canonical tokenization.

Tokenization as Finite-State Transduction

TL;DR

From first principles a finite-state transduction framework is introduced which can encode all possible tokenizations of a regular language and it is constructively shown that Byte-Pair Encoding and MaxMatch, two popular tokenization schemes, are also efficiently representable by simple finite-state transducers.

Abstract

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, fit within this framework. For BPE, this is particularly surprising given its resemblance to context-free grammar and the fact that it does not tokenize strings from left to right. An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern. Here, patterns are encoded at the character level, which creates a mismatch between the constraints and the model's subword vocabulary. While past work has focused only on constraining outputs without regard to the underlying tokenization algorithm, our framework allows for simultaneously constraining the model outputs to match a specified pattern while also adhering to the underlying tokenizer's canonical tokenization.

Paper Structure

This paper contains 26 sections, 11 theorems, 18 equations, 7 figures, 8 algorithms.

Key Result

Lemma 4.0

Let $\mathcal{A}$ be minimal, deterministic automaton over $\Sigma$ and $\mathcal{T}$ be a character-to-subword transducer over $\Sigma \subseteq \Gamma \subset \Sigma^+$ (Equation eq:lexicon_formulation). Then $\varepsilon\textsc{-Removal}(\textsc{Proj}(\mathcal{A} \circ \mathcal{T}))$ is determini

Figures (7)

  • Figure 1: MaxMatch encoding of bananas with $\Gamma = \{\texttt{a, b, n, s, ba, na, ban, bana}\}$. At each step, the longest matching token is added to the tokenized sequence to produce bana␣ na␣ s.
  • Figure 2: An example of BPE tokenization given $\mu = \langle$(t, o), (g, y), (l, o), (p, o), (lo, gy) $\rangle$. Notice that the merges are not necessarily done left to right or in order of length. The final tokenized sequence is to␣ po␣ logy.
  • Figure 3: An example of projecting a character-level input pattern $\mathcal{A} = \texttt{abaabcc}$ to the subword level, given a subword vocabulary $\{\texttt{a, b, c, ab, abc, bc}\}$ represented by the character-to-subword transducer $\mathcal{T}$. The intermediate transducers formed during this process are shown in Figures (c) and (d), and the final, minimized subword automaton is given in Figure (e). Observe that for every accepting path in $\textsc{Min}(\textsc{Proj}(\mathcal{A} \circ \mathcal{T}))$, the concatenation of the subwords on that path satisfy the pattern in $\mathcal{A}$ when spelled out character-by-character. For example, $\texttt{ab\textvisiblespace a\textvisiblespace a\textvisiblespace bc\textvisiblespace c}$ and $\texttt{a\textvisiblespace b\textvisiblespace a\textvisiblespace abc\textvisiblespace c}$, which are accepted by $\textsc{Min}(\textsc{Proj}(\mathcal{A} \circ \mathcal{T}))$, both correspond to $\texttt{abaabcc}$, which is accepted by $\mathcal{A}$.
  • Figure 4: A character-level automaton $\mathcal{A}$ is intersected with subword lexicons over {a, b, aa, ab} represented by the MaxMatch-preserving transducer $\mathcal{T}_{Aho}$, and the tokenization-agnostic transducer $\mathcal{T}$, shown in Figures (b) and (c), respectively. The results of the composition are shown in (d) and (e). In (e) specifically, arcs and states that appear in the unconstrained automaton but would not appear in the constrained automaton (since they do not encode greedy maximal matches) are shown in dashed-red.
  • Figure 5: A merge gadget $G_{(\texttt{a}, \texttt{b})}$ for the merge $(\texttt{a}, \texttt{b}) \rightarrow \texttt{ab}$. All arcs that don't have an output symbol are assumed to be of the form $(q, c, c, p)$.
  • ...and 2 more figures

Theorems & Definitions (13)

  • Lemma 4.0
  • Theorem 5.1
  • Corollary 5.1.1
  • Lemma C.0
  • Proposition C.1
  • Remark C.2
  • Remark C.3
  • Lemma C.4
  • Lemma C.5
  • Proposition C.6
  • ...and 3 more