Auditing Pay-Per-Token in Large Language Models
Ander Artola Velasco, Stratis Tsirtsis, Manuel Gomez-Rodriguez
TL;DR
This work addresses the economic misalignment in pay-per-token pricing for LLMs by formalizing and solving token misreporting as a sequential auditing problem. It introduces a martingale-based auditing framework that relies on a trusted auditor having access to next-token probabilities to verify token counts against model outputs, along with an unbiased estimator for average encoding length. The main contributions are (i) a rigorous sequential hypothesis test with performance guarantees, (ii) an unbiased estimator for the average token sequence length encoding a given string, and (iii) empirical validation showing detection of misreporting within around 70 outputs while keeping false positives below $\alpha=0.05$. The framework strengthens user trust in LLM-as-a-service by providing provable detection of misreporting across a range of misreporting policies and model families, under practical experimental settings.
Abstract
Millions of users rely on a market of cloud-based services to obtain access to state-of-the-art large language models. However, it has been very recently shown that the de facto pay-per-token pricing mechanism used by providers creates a financial incentive for them to strategize and misreport the (number of) tokens a model used to generate an output. In this paper, we develop an auditing framework based on martingale theory that enables a trusted third-party auditor who sequentially queries a provider to detect token misreporting. Crucially, we show that our framework is guaranteed to always detect token misreporting, regardless of the provider's (mis-)reporting policy, and not falsely flag a faithful provider as unfaithful with high probability. To validate our auditing framework, we conduct experiments across a wide range of (mis-)reporting policies using several large language models from the $\texttt{Llama}$, $\texttt{Gemma}$ and $\texttt{Ministral}$ families, and input prompts from a popular crowdsourced benchmarking platform. The results show that our framework detects an unfaithful provider after observing fewer than $\sim 70$ reported outputs, while maintaining the probability of falsely flagging a faithful provider below $α= 0.05$.
