Table of Contents
Fetching ...

Auditing Pay-Per-Token in Large Language Models

Ander Artola Velasco, Stratis Tsirtsis, Manuel Gomez-Rodriguez

TL;DR

This work addresses the economic misalignment in pay-per-token pricing for LLMs by formalizing and solving token misreporting as a sequential auditing problem. It introduces a martingale-based auditing framework that relies on a trusted auditor having access to next-token probabilities to verify token counts against model outputs, along with an unbiased estimator for average encoding length. The main contributions are (i) a rigorous sequential hypothesis test with performance guarantees, (ii) an unbiased estimator for the average token sequence length encoding a given string, and (iii) empirical validation showing detection of misreporting within around 70 outputs while keeping false positives below $\alpha=0.05$. The framework strengthens user trust in LLM-as-a-service by providing provable detection of misreporting across a range of misreporting policies and model families, under practical experimental settings.

Abstract

Millions of users rely on a market of cloud-based services to obtain access to state-of-the-art large language models. However, it has been very recently shown that the de facto pay-per-token pricing mechanism used by providers creates a financial incentive for them to strategize and misreport the (number of) tokens a model used to generate an output. In this paper, we develop an auditing framework based on martingale theory that enables a trusted third-party auditor who sequentially queries a provider to detect token misreporting. Crucially, we show that our framework is guaranteed to always detect token misreporting, regardless of the provider's (mis-)reporting policy, and not falsely flag a faithful provider as unfaithful with high probability. To validate our auditing framework, we conduct experiments across a wide range of (mis-)reporting policies using several large language models from the $\texttt{Llama}$, $\texttt{Gemma}$ and $\texttt{Ministral}$ families, and input prompts from a popular crowdsourced benchmarking platform. The results show that our framework detects an unfaithful provider after observing fewer than $\sim 70$ reported outputs, while maintaining the probability of falsely flagging a faithful provider below $α= 0.05$.

Auditing Pay-Per-Token in Large Language Models

TL;DR

This work addresses the economic misalignment in pay-per-token pricing for LLMs by formalizing and solving token misreporting as a sequential auditing problem. It introduces a martingale-based auditing framework that relies on a trusted auditor having access to next-token probabilities to verify token counts against model outputs, along with an unbiased estimator for average encoding length. The main contributions are (i) a rigorous sequential hypothesis test with performance guarantees, (ii) an unbiased estimator for the average token sequence length encoding a given string, and (iii) empirical validation showing detection of misreporting within around 70 outputs while keeping false positives below . The framework strengthens user trust in LLM-as-a-service by providing provable detection of misreporting across a range of misreporting policies and model families, under practical experimental settings.

Abstract

Millions of users rely on a market of cloud-based services to obtain access to state-of-the-art large language models. However, it has been very recently shown that the de facto pay-per-token pricing mechanism used by providers creates a financial incentive for them to strategize and misreport the (number of) tokens a model used to generate an output. In this paper, we develop an auditing framework based on martingale theory that enables a trusted third-party auditor who sequentially queries a provider to detect token misreporting. Crucially, we show that our framework is guaranteed to always detect token misreporting, regardless of the provider's (mis-)reporting policy, and not falsely flag a faithful provider as unfaithful with high probability. To validate our auditing framework, we conduct experiments across a wide range of (mis-)reporting policies using several large language models from the , and families, and input prompts from a popular crowdsourced benchmarking platform. The results show that our framework detects an unfaithful provider after observing fewer than reported outputs, while maintaining the probability of falsely flagging a faithful provider below .

Paper Structure

This paper contains 21 sections, 4 theorems, 46 equations, 5 figures, 1 table, 4 algorithms.

Key Result

Proposition 1

Let $K \sim P^{K}$ and $R_{k}$ be defined by Eq. eq:rk for each $k \in \{0, \dots, K\}$. Then, it holds that:

Figures (5)

  • Figure 1: Auditing faithful providers. The panels show realizations of the test process $M$ for three (simulated) faithful providers, each serving a different large language model. In each realization, we sequentially query the provider using prompts picked uniformly at random from the LMSYS Chatbot Arena dataset, and compute $M_i$ using Eq. \ref{['eq:martignale-definition']} with $\lambda=0.07, 0.13$ and $0.19$, respectively. In all panels, the dashed line illustrates the threshold $1/\alpha$ needed to flag a provider and, for clarity, we display $30$ realizations randomly sampled from a total of $150$. Moreover, we set the false positive rate bound to $\alpha = 0.05$ and the temperature of the models to $1$. Refer to Appendix \ref{['app:additional-experimental-results']} for qualitatively similar results using other temperature values.
  • Figure 2: Auditing an unfaithful provider who serves Llama-3.2-1B-Instruct. The two panels show realizations of the test process $M$ (top) and the distribution of detection times $\tau=\inf \{i\, \colon\, M_i >1/\alpha \}$ (bottom) when the provider uses, respectively, random and heuristic policies $\pi$ of varying intensity $\mathcal{I}(\pi)$. In each realization, we sequentially query the provider using prompts picked uniformly at random from the LMSYS Chatbot Arena dataset, and compute $M_i$ using Eq. \ref{['eq:martignale-definition']} with $\lambda=0.07$. In each panel, the three different intensity values correspond to policies $\pi$ parameterized by $m=1,2,3$, with higher values of $m$ leading to higher (darker) intensities and, for each $m$, we show $30$ realizations. In all panels, we set the false positive rate bound to $\alpha = 0.05$ and the temperature of the models to $1$. Refer to Appendix \ref{['app:additional-experimental-results']} for qualitatively similar results using other models and temperature values.
  • Figure 3: Auditing faithful providers. The panels show realizations of the test process $M$ for three (simulated) faithful providers, each serving a different large language model, across different temperature values used during generation and auditing. In each realization, we sequentially query the provider using prompts picked uniformly at random from the LMSYS Chatbot Arena dataset, and compute $M_i$ using Eq. \ref{['eq:martignale-definition']} with $\lambda=0.07, 0.13$ and $0.19$ for temperature $1.0$, $\lambda=0.10, 0.11$ and $0.10$ for temperature $1.0$, and $\lambda=0.10, 0.10$ and $0.19$ for temperature $1.0$, for Llama-3.2-1B-Instruct, Ministral-8B-Instruct-2410 and Gemma-3-1B-It, respectively. In all panels, the dashed line illustrates the threshold $1/\alpha$ needed to flag a provider and, for clarity, we display $30$ realizations randomly sampled from a total of $150$. Moreover, we set the false positive rate bound to $\alpha = 0.05$.
  • Figure 4: Auditing an unfaithful provider who misreports using Algorithm \ref{['alg:random']}. The panels show realizations of the test process $M$ (top) and the distribution of detection times $\tau=\inf \{i\, \colon\, M_i >1/\alpha \}$ (bottom) when the provider uses random policies $\pi$ of varying intensity $\mathcal{I}(\pi)$, across different models served and temperature values. In each realization, we sequentially query the provider using prompts picked uniformly at random from the LMSYS Chatbot Arena dataset, and compute $M_i$ using Eq. \ref{['eq:martignale-definition']} with $\lambda=0.07, 0.13$ and $0.19$ for temperature $1.0$, $\lambda=0.10, 0.11$ and $0.10$ for temperature $1.0$, and $\lambda=0.10, 0.10$ and $0.19$ for temperature $1.0$, for Llama-3.2-1B-Instruct, Ministral-8B-Instruct-2410 and Gemma-3-1B-It, respectively. In each panel, the three different intensity values correspond to policies $\pi$ parameterized by $m=1,2,3$, with higher values of $m$ leading to higher (darker) intensities, and, for each $m$, we show $30$ realizations. In all panels, we set the false positive rate bound to $\alpha = 0.05$.
  • Figure 5: Auditing an unfaithful provider who misreports using Algorithm \ref{['alg:heuristic']}. The panels show realizations of the test process $M$ (top) and the distribution of detection times $\tau=\inf \{i\, \colon\, M_i >1/\alpha \}$ (bottom) when the provider uses random policies $\pi$ of varying intensity $\mathcal{I}(\pi)$, across different models served and temperature values. In each realization, we sequentially query the provider using prompts picked uniformly at random from the LMSYS Chatbot Arena dataset, and compute $M_i$ using Eq. \ref{['eq:martignale-definition']} with $\lambda=0.07, 0.13$ and $0.19$ for temperature $1.0$, $\lambda=0.10, 0.11$ and $0.10$ for temperature $1.0$, and $\lambda=0.10, 0.10$ and $0.19$ for temperature $1.0$, for Llama-3.2-1B-Instruct, Ministral-8B-Instruct-2410 and Gemma-3-1B-It, respectively. In each panel, for clarity, we show $20$ randomly sampled realizations. In all panels, we set the false positive rate bound to $\alpha = 0.05$.

Theorems & Definitions (4)

  • Proposition 1
  • Proposition 2
  • Theorem 3
  • Theorem 4