Table of Contents
Fetching ...

Understanding and Mitigating Tokenization Bias in Language Models

Buu Phan, Marton Havasi, Matthew Muckley, Karen Ullrich

TL;DR

The paper addresses tokenization-induced bias in autoregressive language models by showing that schemes like $MPE$ and $BPE$ bias next-token and character-level predictions, with biases persisting despite more data. It introduces bias-correction methods—Maximum Prefix Correction (MPC) for MPE and a BPE-based bias-correction approach—to compute unbiased character-level probabilities $P(x^{N}_{n+1}|x^{n}_{1})$ from tokenized models without finetuning, with MPC exhibiting linear-in-length complexity. The approach is validated on a Markov-chain setup, where baseline token-conditioned probabilities $P(x_{n+1}|t^{i}_{1})$ fail to recover true dynamics, while the proposed methods accurately recover $P(x^{N}_{n+1}|x^{n}_{1})$ and can simulate token-free behavior. This work provides a theoretical and practical framework for unbiased evaluation and cross-vocabulary inference in tokenized LMs, potentially enabling seamless transfer between tokenized and token-free representations without retraining.

Abstract

State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that popular encoding schemes, such as maximum prefix encoding (MPE) and byte-pair-encoding (BPE), induce a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, for each encoding scheme above, we propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data. Our methods do not require finetuning the model, and the complexity, defined as the number of model runs, scales linearly with the sequence length in the case of MPE. As a result, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the language model.

Understanding and Mitigating Tokenization Bias in Language Models

TL;DR

The paper addresses tokenization-induced bias in autoregressive language models by showing that schemes like and bias next-token and character-level predictions, with biases persisting despite more data. It introduces bias-correction methods—Maximum Prefix Correction (MPC) for MPE and a BPE-based bias-correction approach—to compute unbiased character-level probabilities from tokenized models without finetuning, with MPC exhibiting linear-in-length complexity. The approach is validated on a Markov-chain setup, where baseline token-conditioned probabilities fail to recover true dynamics, while the proposed methods accurately recover and can simulate token-free behavior. This work provides a theoretical and practical framework for unbiased evaluation and cross-vocabulary inference in tokenized LMs, potentially enabling seamless transfer between tokenized and token-free representations without retraining.

Abstract

State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that popular encoding schemes, such as maximum prefix encoding (MPE) and byte-pair-encoding (BPE), induce a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, for each encoding scheme above, we propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data. Our methods do not require finetuning the model, and the complexity, defined as the number of model runs, scales linearly with the sequence length in the case of MPE. As a result, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the language model.

Paper Structure

This paper contains 28 sections, 10 theorems, 23 equations, 7 figures, 3 algorithms.

Key Result

Proposition 2.3

(Token-Induced Zero Probability) Let $t^i_1$ be a sequence of input tokens. For any invalid encoding $t^i_1$, we have $P_{\mathrm{gt}}(t^i_1){=}0.0$ and the conditional probability $P_{\mathrm{gt}}(t_{i+1}|t^i_1)$ is undefined. In the case $t^i_1$ is valid, then $P_{\mathrm{gt}}(t_{i+1}|t^i_1){=}0.0

Figures (7)

  • Figure 1: Next-Character sampling bias introduced by the WordPiece encoding algorithm. In this example, given the context token $"A"$, the model will always predict the next token as $"B"$ with probability $1.0$. We present a technique that, given a language model trained on tokenized domain, eliminate this bias and recover the accurate unbiased sampling distribution.
  • Figure 2: MPC Visualization. At each recursive call, the Branch step finds tokens that starts with the query string while the Pass step extracts and employs the next token and leftover string for the next recursive call until meeting the base case.
  • Figure 3: Our method accurately estimates the transition probability of a 3rd order Markov chain while the baseline method fails to.
  • Figure 4: Interpretations of Proposition \ref{['tokenbasic']}, which shows that for any string $s$ with prefix $x^n_1$, its token within $x^n_1$ must start at a certain designated positions. For each encoding, same color denotes belonging to the same token. This would later allows us to construct an efficient algorithm to correct the bias. See Corollary \ref{['empty_corol']} for details on invalid encoding.
  • Figure 5: (Left) Representation of $P(x^n_1)$ using tokenized LM. We also show an example of cover encodings and valid/invalid encoding. (Right) Illustration of the Byte-Pair Correction Algorithm for BPE encoding. Green tick and red cross denotes valid and invalid encodings respectively, which can be checked using Definition \ref{['invalid_enc_bpe']}. The termination step does not appear in the original algorithm (for simplicity) but can be easily implemented.
  • ...and 2 more figures

Theorems & Definitions (32)

  • Definition 2.1
  • Definition 2.2
  • Proposition 2.3
  • proof
  • Remark 1
  • Proposition 3.1
  • proof
  • Corollary 3.2
  • proof
  • Proposition 2.1
  • ...and 22 more