Table of Contents
Fetching ...

Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service

Ivi Chatzi, Nina Corvelo Benz, Stratis Tsirtsis, Manuel Gomez-Rodriguez

TL;DR

This paper reveals that pay-per-token pricing for LLM-as-a-service can incur arbitrary cost differences even for identical outputs due to tokenization multiplicity, especially in non-English languages. It introduces canonical generation, a constrained generation framework that forces outputs to the training-time canonical tokenization, and a Gumbel-Max-based sampling algorithm to realize it efficiently. The authors prove that subsequences of canonical token sequences are themselves canonical for common tokenizers and show that canonical generation yields token-sequence distributions closer to the training distribution in KL terms, while maintaining similar performance and speed to standard generation. Empirically, tokenization multiplicity is shown to occur across languages and tasks; canonical generation fixes the pricing issue with minimal practical penalties, making it a promising approach for fairer and more predictable LLM pricing and usage. The work also provides insights into tokenizer properties and constrained generation, with open-source code and data available for replication.

Abstract

Providers of LLM-as-a-service have predominantly adopted a simple pricing model: users pay a fixed price per token. Consequently, one may think that the price two different users would pay for the same output string under the same input prompt is the same. In our work, we show that, surprisingly, this is not (always) true. We find empirical evidence that, particularly for non-english outputs, both proprietary and open-weights LLMs often generate the same (output) string with multiple different tokenizations, even under the same input prompt, and this in turn leads to arbitrary price variation. To address the problem of tokenization multiplicity, we introduce canonical generation, a type of constrained generation that restricts LLMs to only generate canonical tokenizations -- the unique tokenization in which each string is tokenized during the training process of an LLM. Further, we introduce an efficient sampling algorithm for canonical generation based on the Gumbel-Max trick. Experiments on a variety of natural language tasks demonstrate that our sampling algorithm for canonical generation is comparable to standard sampling in terms of performance and runtime, and it solves the problem of tokenization multiplicity.

Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service

TL;DR

This paper reveals that pay-per-token pricing for LLM-as-a-service can incur arbitrary cost differences even for identical outputs due to tokenization multiplicity, especially in non-English languages. It introduces canonical generation, a constrained generation framework that forces outputs to the training-time canonical tokenization, and a Gumbel-Max-based sampling algorithm to realize it efficiently. The authors prove that subsequences of canonical token sequences are themselves canonical for common tokenizers and show that canonical generation yields token-sequence distributions closer to the training distribution in KL terms, while maintaining similar performance and speed to standard generation. Empirically, tokenization multiplicity is shown to occur across languages and tasks; canonical generation fixes the pricing issue with minimal practical penalties, making it a promising approach for fairer and more predictable LLM pricing and usage. The work also provides insights into tokenizer properties and constrained generation, with open-source code and data available for replication.

Abstract

Providers of LLM-as-a-service have predominantly adopted a simple pricing model: users pay a fixed price per token. Consequently, one may think that the price two different users would pay for the same output string under the same input prompt is the same. In our work, we show that, surprisingly, this is not (always) true. We find empirical evidence that, particularly for non-english outputs, both proprietary and open-weights LLMs often generate the same (output) string with multiple different tokenizations, even under the same input prompt, and this in turn leads to arbitrary price variation. To address the problem of tokenization multiplicity, we introduce canonical generation, a type of constrained generation that restricts LLMs to only generate canonical tokenizations -- the unique tokenization in which each string is tokenized during the training process of an LLM. Further, we introduce an efficient sampling algorithm for canonical generation based on the Gumbel-Max trick. Experiments on a variety of natural language tasks demonstrate that our sampling algorithm for canonical generation is comparable to standard sampling in terms of performance and runtime, and it solves the problem of tokenization multiplicity.

Paper Structure

This paper contains 24 sections, 7 theorems, 11 equations, 18 figures, 10 tables, 1 algorithm.

Key Result

Theorem 2

BPE-, Unigram- and Wordpiece-based tokenizers are non-recovering.

Figures (18)

  • Figure 1: Translation task example. The top box shows the input prompt, which consists of a translation instruction and the accompanying Wikipedia text to be translated. The latter two boxes show two outputs generated by Qwen2.5-7B-Instruct as response to the input prompt, corresponding to the same string but with two different tokenizations.
  • Figure 2: Probability of tokenization multiplicity. The plots show the empirical probability that the length of two output tokenizations to the same input prompt differ, conditioned on the output strings being the same. Each panel corresponds to one of the three tasks we consider in our experiments involving outputs in the German language. Across all panels, error bars represent $95\%$ confidence intervals resulting from $100$ input prompts.
  • Figure 3: Magnitude of price variation. The plots show the empirical distribution of the relative difference in length between the longest and shortest tokenization of each output string, across all outputs where tokenization multiplicity occurs. Each panel corresponds to one of the three tasks we consider in our experiments involving outputs in the German language. Across all panels, box plots show the quartiles of the respective distributions with black horizontal lines representing median values.
  • Figure 4: Tokenization multiplicity across languages. The plots show the number of inputs prompts for which we observe at least two outputs given by $\texttt{gpt5m}$ with the same string but different tokenization lengths. Each panel corresponds to one of the three tasks we consider in our experiments and pairs of letters on the x-axis correspond to different languages. Refer to Table \ref{['tab:languages']} in Appendix \ref{['app:exp-details']} for details regarding the languages we use and to Appendix \ref{['app:results-multiplicity']} for qualitatively similar results using other models.
  • Figure 5: Examples of tokenization multiplicity in (a) translation, (b) spell checking, and (c) rephrasing. In each example, the top box shows the input prompt, which consists of an instruction of the task and the accompanying Wikipedia text to be processed. The latter two boxes show two outputs generated by (a) Qwen2.5-7B-Instruct, (b) Llama3.1-8B-Instruct and (c) GPT-4o-mini as a response to the input prompt, corresponding to the same string but with two different tokenizations.
  • ...and 13 more figures

Theorems & Definitions (12)

  • Definition 1
  • Theorem 2
  • Theorem 3
  • Definition 4
  • Definition 5
  • Lemma 1
  • Lemma 2
  • Theorem 6
  • Theorem 7
  • Definition 8
  • ...and 2 more