Table of Contents
Fetching ...

Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

Ander Artola Velasco, Stratis Tsirtsis, Nastaran Okati, Manuel Gomez-Rodriguez

TL;DR

This work analyzes pay-per-token pricing in LLM-as-a-service and shows it creates moral hazard due to tokenization ambiguity, enabling overcharging without altering the user-visible output. By formulating a principal-agent model, the authors prove that finding the longest plausible tokenization under common sampling schemes is NP-hard, propose a practical heuristic that can profitably misreport tokenizations, and then derive an incentive-compatible alternative: pay-per-character pricing, which prices by string length and eliminates token-count incentives. They characterize IC pricing formally—ruling out pay-per-token as IC when multi-character tokens exist—and offer a method to preserve average margins when transitioning to pay-per-character pricing, using the token-to-character ratio. Empirical validation on Llama, Gemma, and Ministral models with LMSYS prompts demonstrates the vulnerability under pay-per-token and the viability of pay-per-character pricing, underscoring a practical pathway to stronger consumer protection and fairer pricing in LLM-as-a-service.

Abstract

State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it: they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we develop an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion. Crucially, we demonstrate that the cost of running the algorithm is lower than the additional revenue from overcharging users, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, we show that, to eliminate the financial incentive to strategize, a pricing mechanism must price tokens linearly on their character count. While this makes a provider's profit margin vary across tokens, we introduce a simple prescription under which the provider who adopts such an incentive-compatible pricing mechanism can maintain the average profit margin they had under the pay-per-token pricing mechanism. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the $\texttt{Llama}$, $\texttt{Gemma}$ and $\texttt{Ministral}$ families, and input prompts from the LMSYS Chatbot Arena platform.

Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

TL;DR

This work analyzes pay-per-token pricing in LLM-as-a-service and shows it creates moral hazard due to tokenization ambiguity, enabling overcharging without altering the user-visible output. By formulating a principal-agent model, the authors prove that finding the longest plausible tokenization under common sampling schemes is NP-hard, propose a practical heuristic that can profitably misreport tokenizations, and then derive an incentive-compatible alternative: pay-per-character pricing, which prices by string length and eliminates token-count incentives. They characterize IC pricing formally—ruling out pay-per-token as IC when multi-character tokens exist—and offer a method to preserve average margins when transitioning to pay-per-character pricing, using the token-to-character ratio. Empirical validation on Llama, Gemma, and Ministral models with LMSYS prompts demonstrates the vulnerability under pay-per-token and the viability of pay-per-character pricing, underscoring a practical pathway to stronger consumer protection and fairer pricing in LLM-as-a-service.

Abstract

State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it: they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we develop an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion. Crucially, we demonstrate that the cost of running the algorithm is lower than the additional revenue from overcharging users, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, we show that, to eliminate the financial incentive to strategize, a pricing mechanism must price tokens linearly on their character count. While this makes a provider's profit margin vary across tokens, we introduce a simple prescription under which the provider who adopts such an incentive-compatible pricing mechanism can maintain the average profit margin they had under the pay-per-token pricing mechanism. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the , and families, and input prompts from the LMSYS Chatbot Arena platform.

Paper Structure

This paper contains 23 sections, 4 theorems, 16 equations, 12 figures, 1 table, 2 algorithms.

Key Result

Theorem 3

The problem of finding the longest tokenization of a given output that is plausible under top-$p$ sampling, as defined in Eq. eq:constrained, is NP-Hard.

Figures (12)

  • Figure 1: Misreporting the tokenization of outputs generated by different LLMs using Algorithm \ref{['alg:random']}. Panel (a) shows the percentage of tokens overcharged by an unfaithful provider who misreports the tokenization of the outputs to 600 prompts from the LMSYS Chatbot Arena platform using Algorithm \ref{['alg:random']}, for different values of $m$. Panel (b) shows the fraction of outputs to the same 600 prompts where Algorithm \ref{['alg:random']} returns a tokenization that is plausible under top-$p$ sampling, for different values of $m$. In both panels, we set the temperature to $1.3$ and use top-$p$ sampling with $p=0.95$. In panel (b), we repeat the experiment $5$ times to obtain $90$% confidence intervals.
  • Figure 2: Additional revenue from misreporting the tokenization of outputs using Algorithm \ref{['alg:heuristic']}. The panels show the percentage of tokens overcharged by an unfaithful provider who misreports the tokenization of the outputs generated by an LLM to $600$ prompts from the LMSYS Chatbot Arena platform using the reporting policies $\pi^\texttt{H}_m$ implemented by Algorithm \ref{['alg:heuristic']}, for different values of $m$ and $p$. The dashed vertical lines correspond to the optimal value of $m$. Here, we set the temperature of the model to $1.3$ and repeat each experiment $5$ times to obtain $90$% confidence intervals. Refer to Appendix \ref{['app:lmsys']} for additional results using alternative temperature values and other LLMs.
  • Figure 3: Financial gain from misreporting the tokenization of outputs using Algorithm \ref{['alg:heuristic']}. The panels show the utility gain that an unfaithful provider who misreports the tokenization of the outputs generated by an LLM to $600$ prompts from the LMSYS Chatbot Arena platform using Algorithm \ref{['alg:heuristic']} can achieve, for different values of $p$ and the provider's profit margin $\rho_0$. The dashed vertical lines represent the minimum margin above which misreporting is financially viable according to Eq. \ref{['eq:heuristic-profitable']}. Here, for each value of $p$, we run Algorithm \ref{['alg:heuristic']} using the optimal number of iterations $m$ shown in Figure \ref{['fig:heuristic-topp']}, set the temperature of the model to $1.3$ and repeat each experiment $5$ times to obtain $90$% confidence intervals. Refer to Figure \ref{['fig:heuristic-profit-appendix']} of Appendix \ref{['app:lmsys']} for additional results using alternative temperature values and other LLMs.
  • Figure 4: Provider's profit margin under a pay-per-character pricing mechanism. The panels show, across different models and different values $\rho_o$ of the provider's profit margin under pay-per-token, the (empirical) cumulative distribution function of their profit margin $\rho(\mathbf{t})=1 - c_\text{gen}(\mathbf{t})/r(\mathbf{t})$ per output $\mathbf{t}$ after their transition to pay-per-character. Here, we first compute the average ratio of number of tokens to number of characters (tpc) across the responses of each model to $600$ multilingual prompts sampled from the LMSYS Chatbot Arena dataset, with sampling proportional to each language's frequency. Then, we set $r_c = r_o\cdot\texttt{tpc}$ and compute the provider's profit margin $\rho(\mathbf{t})$ for outputs to $600$ different multilingual prompts. In all panels, dashed vertical lines show empirical averages of the respective distributions, and we set the temperature of the models to $1.3$ and use top-$p$ sampling with $p=0.95$.
  • Figure 5: Fraction of outputs for which Algorithm \ref{['alg:heuristic']} finds a longer plausible tokenization. The figure shows the fraction of outputs generated by different LLMs to $600$ prompts from the LMSYS Chatbot Arena platform for which Algorithm \ref{['alg:heuristic']} finds a longer plausible tokenization under top$-p$ sampling, for values of $m$, $p$ and temperature. We repeat each experiment $5$ times to calculate $90$% confidence intervals.
  • ...and 7 more figures

Theorems & Definitions (7)

  • Definition 1: Pricing mechanism
  • Definition 2: Pay-per-token
  • Theorem 3
  • Definition 4
  • Proposition 5
  • Theorem 6
  • Corollary 7