Table of Contents
Fetching ...

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

Buu Phan, Brandon Amos, Itai Gat, Marton Havasi, Matthew Muckley, Karen Ullrich

TL;DR

The paper analyzes how tokenization affects language-model predictions by formalizing byte-level and tokenized data-generating processes and showing a statistical equivalence between them. It identifies tokenization bias, a discrepancy in next-byte distributions between tokenized and byte-level models, and introduces the Byte-Token Representation Lemma (BTR) together with an exact next-byte sampling method that eliminates bias without retraining. This enables zero-shot conversion to token-free behavior and supports robust model ensembling by mapping heterogeneous vocabularies to a universal byte space, with practical gains in fill-in-the-middle code tasks (up to 18% improvement) and ensemble performance (up to 3.7%). While the method imposes memory overhead and introduces some inference-time costs, it provides a principled, training-free way to obtain bias-free byte-level predictions from any tokenized LM, broadening applicability to long-context tasks and multi-model systems.

Abstract

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing and comparing the stochastic behavior of tokenized models with their byte-level, or token-free, counterparts. We discover that, even when the two models are statistically equivalent, their predictive distributions over the next byte can be substantially different, a phenomenon we term as ``tokenization bias''. To fully characterize this phenomenon, we introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution. From this result, we develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization. In other words, this enables zero-shot conversion of tokenized LMs into statistically equivalent token-free ones. We demonstrate its broad applicability with two use cases: fill-in-the-middle (FIM) tasks and model ensembles. In FIM tasks where input prompts may terminate mid-token, leading to out-of-distribution tokenization, our method mitigates performance degradation and achieves 18% improvement in FIM coding benchmarks, while consistently outperforming the standard token healing fix. For model ensembles where each model employs a distinct vocabulary, our approach enables seamless integration, resulting in improved performance up to 3.7% over individual models across various standard baselines in reasoning, knowledge, and coding. Code is available at: https://github.com/facebookresearch/Exact-Byte-Level-Probabilities-from-Tokenized-LMs

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

TL;DR

The paper analyzes how tokenization affects language-model predictions by formalizing byte-level and tokenized data-generating processes and showing a statistical equivalence between them. It identifies tokenization bias, a discrepancy in next-byte distributions between tokenized and byte-level models, and introduces the Byte-Token Representation Lemma (BTR) together with an exact next-byte sampling method that eliminates bias without retraining. This enables zero-shot conversion to token-free behavior and supports robust model ensembling by mapping heterogeneous vocabularies to a universal byte space, with practical gains in fill-in-the-middle code tasks (up to 18% improvement) and ensemble performance (up to 3.7%). While the method imposes memory overhead and introduces some inference-time costs, it provides a principled, training-free way to obtain bias-free byte-level predictions from any tokenized LM, broadening applicability to long-context tasks and multi-model systems.

Abstract

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing and comparing the stochastic behavior of tokenized models with their byte-level, or token-free, counterparts. We discover that, even when the two models are statistically equivalent, their predictive distributions over the next byte can be substantially different, a phenomenon we term as ``tokenization bias''. To fully characterize this phenomenon, we introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution. From this result, we develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization. In other words, this enables zero-shot conversion of tokenized LMs into statistically equivalent token-free ones. We demonstrate its broad applicability with two use cases: fill-in-the-middle (FIM) tasks and model ensembles. In FIM tasks where input prompts may terminate mid-token, leading to out-of-distribution tokenization, our method mitigates performance degradation and achieves 18% improvement in FIM coding benchmarks, while consistently outperforming the standard token healing fix. For model ensembles where each model employs a distinct vocabulary, our approach enables seamless integration, resulting in improved performance up to 3.7% over individual models across various standard baselines in reasoning, knowledge, and coding. Code is available at: https://github.com/facebookresearch/Exact-Byte-Level-Probabilities-from-Tokenized-LMs

Paper Structure

This paper contains 32 sections, 4 theorems, 16 equations, 12 figures, 3 tables, 3 algorithms.

Key Result

Proposition 1

$\mathcal{X}(t^k_1) = \varnothing$ if and only if $t^k_1$ is invalid. As a result, $P(t^k_1)=0.0$. Furthermore, $P(t_k|t^{k-1}_1) = 0.0$ if $t^{k-1}_1$ is valid, otherwise, it is undefined. Proof. See Appendix proof_invalid_token

Figures (12)

  • Figure 1: Left: Tokenized LMs can experience tokenization bias when prompts end mid-token, as in this code completion example. This means that the correct solution, $\color{red}{\texttt{a}}$ has zero probability of being chosen. Our method avoids this problem and can predict the correct token while using the same model. Right: Our method maps next-token predictions of arbitrary tokenized LMs to statistically equivalent next-byte predictions (see Section \ref{['btr_subsection']} for details). This enables any model ensemble strategy, such as averaging or mixture of experts.
  • Figure 1: We evaluate the token and equivalent byte-level model performance of various open source LMs. We further show the byte-ensemble performance of the top-2 performing models. For all benchmarks but GSM8K, byte-level ensembles outperform single models and voting.
  • Figure 2: Tokenization bias on a $1^{st}$-order Markov chain. Given the context token $"\texttt{A}"$, the model will never sample the next token as $"\texttt{A}"$, rather than with probability $1-\alpha$. In practice, this bias occurs when prompts end with tokens that are part of another token, a common issue in the FIM tasks, leading to incorrect completions by the model.
  • Figure 2: Token and byte-level performance of open-weight code LMs. The byte-ensemble performance of the top-2 models achieves the best results on the Human Eval benchmark while matching Yi-Coder-1.5B on MBPP.
  • Figure 3: Tokenization bias on a $3^{\mathrm{rd}}$ order Markov chain. Our byte-token conversion (bias correction) method in Section \ref{['method_section']} accurately recovers $P(x_{n+1}|x^n_1)$ of the original chain.
  • ...and 7 more figures

Theorems & Definitions (15)

  • Definition 1
  • Definition 2
  • Definition 3
  • Proposition 1
  • Remark 1
  • Definition 4
  • Lemma 1
  • Remark 2
  • Corollary 1
  • Remark 3
  • ...and 5 more