Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives
Ander Artola Velasco, Stratis Tsirtsis, Nastaran Okati, Manuel Gomez-Rodriguez
TL;DR
This work analyzes pay-per-token pricing in LLM-as-a-service and shows it creates moral hazard due to tokenization ambiguity, enabling overcharging without altering the user-visible output. By formulating a principal-agent model, the authors prove that finding the longest plausible tokenization under common sampling schemes is NP-hard, propose a practical heuristic that can profitably misreport tokenizations, and then derive an incentive-compatible alternative: pay-per-character pricing, which prices by string length and eliminates token-count incentives. They characterize IC pricing formally—ruling out pay-per-token as IC when multi-character tokens exist—and offer a method to preserve average margins when transitioning to pay-per-character pricing, using the token-to-character ratio. Empirical validation on Llama, Gemma, and Ministral models with LMSYS prompts demonstrates the vulnerability under pay-per-token and the viability of pay-per-character pricing, underscoring a practical pathway to stronger consumer protection and fairer pricing in LLM-as-a-service.
Abstract
State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it: they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we develop an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion. Crucially, we demonstrate that the cost of running the algorithm is lower than the additional revenue from overcharging users, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, we show that, to eliminate the financial incentive to strategize, a pricing mechanism must price tokens linearly on their character count. While this makes a provider's profit margin vary across tokens, we introduce a simple prescription under which the provider who adopts such an incentive-compatible pricing mechanism can maintain the average profit margin they had under the pay-per-token pricing mechanism. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the $\texttt{Llama}$, $\texttt{Gemma}$ and $\texttt{Ministral}$ families, and input prompts from the LMSYS Chatbot Arena platform.
