Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles
Buu Phan, Brandon Amos, Itai Gat, Marton Havasi, Matthew Muckley, Karen Ullrich
TL;DR
The paper analyzes how tokenization affects language-model predictions by formalizing byte-level and tokenized data-generating processes and showing a statistical equivalence between them. It identifies tokenization bias, a discrepancy in next-byte distributions between tokenized and byte-level models, and introduces the Byte-Token Representation Lemma (BTR) together with an exact next-byte sampling method that eliminates bias without retraining. This enables zero-shot conversion to token-free behavior and supports robust model ensembling by mapping heterogeneous vocabularies to a universal byte space, with practical gains in fill-in-the-middle code tasks (up to 18% improvement) and ensemble performance (up to 3.7%). While the method imposes memory overhead and introduces some inference-time costs, it provides a principled, training-free way to obtain bias-free byte-level predictions from any tokenized LM, broadening applicability to long-context tasks and multi-model systems.
Abstract
Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing and comparing the stochastic behavior of tokenized models with their byte-level, or token-free, counterparts. We discover that, even when the two models are statistically equivalent, their predictive distributions over the next byte can be substantially different, a phenomenon we term as ``tokenization bias''. To fully characterize this phenomenon, we introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution. From this result, we develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization. In other words, this enables zero-shot conversion of tokenized LMs into statistically equivalent token-free ones. We demonstrate its broad applicability with two use cases: fill-in-the-middle (FIM) tasks and model ensembles. In FIM tasks where input prompts may terminate mid-token, leading to out-of-distribution tokenization, our method mitigates performance degradation and achieves 18% improvement in FIM coding benchmarks, while consistently outperforming the standard token healing fix. For model ensembles where each model employs a distinct vocabulary, our approach enables seamless integration, resulting in improved performance up to 3.7% over individual models across various standard baselines in reasoning, knowledge, and coding. Code is available at: https://github.com/facebookresearch/Exact-Byte-Level-Probabilities-from-Tokenized-LMs
