Table of Contents
Fetching ...

CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs

Guoheng Sun, Ziyao Wang, Bowei Tian, Meng Liu, Zheyu Shen, Shwai He, Yexiao He, Wanghao Ye, Yiting Wang, Ang Li

TL;DR

CoIn addresses the transparency gap in Commercial Opaque LLM APIs by auditing hidden reasoning tokens that contribute to billing. It combines a verifiable hash-tree-based Token Quantity Verification with embedding-based Semantic Validity Verification to detect both naive and adaptive token inflation without exposing proprietary content. The framework achieves up to 94.7% detection accuracy against adaptive attacks and generalizes to unseen domains, showcasing practical potential for third-party auditing and billing transparency. This work establishes a foundation for verifiable inference services in opaque LLM ecosystems and provides open datasets and code for reproducibility.

Abstract

As post-training techniques evolve, large language models (LLMs) are increasingly augmented with structured multi-step reasoning abilities, often optimized through reinforcement learning. These reasoning-enhanced models outperform standard LLMs on complex tasks and now underpin many commercial LLM APIs. However, to protect proprietary behavior and reduce verbosity, providers typically conceal the reasoning traces while returning only the final answer. This opacity introduces a critical transparency gap: users are billed for invisible reasoning tokens, which often account for the majority of the cost, yet have no means to verify their authenticity. This opens the door to token count inflation, where providers may overreport token usage or inject synthetic, low-effort tokens to inflate charges. To address this issue, we propose CoIn, a verification framework that audits both the quantity and semantic validity of hidden tokens. CoIn constructs a verifiable hash tree from token embedding fingerprints to check token counts, and uses embedding-based relevance matching to detect fabricated reasoning content. Experiments demonstrate that CoIn, when deployed as a trusted third-party auditor, can effectively detect token count inflation with a success rate reaching up to 94.7%, showing the strong ability to restore billing transparency in opaque LLM services. The dataset and code are available at https://github.com/CASE-Lab-UMD/LLM-Auditing-CoIn.

CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs

TL;DR

CoIn addresses the transparency gap in Commercial Opaque LLM APIs by auditing hidden reasoning tokens that contribute to billing. It combines a verifiable hash-tree-based Token Quantity Verification with embedding-based Semantic Validity Verification to detect both naive and adaptive token inflation without exposing proprietary content. The framework achieves up to 94.7% detection accuracy against adaptive attacks and generalizes to unseen domains, showcasing practical potential for third-party auditing and billing transparency. This work establishes a foundation for verifiable inference services in opaque LLM ecosystems and provides open datasets and code for reproducibility.

Abstract

As post-training techniques evolve, large language models (LLMs) are increasingly augmented with structured multi-step reasoning abilities, often optimized through reinforcement learning. These reasoning-enhanced models outperform standard LLMs on complex tasks and now underpin many commercial LLM APIs. However, to protect proprietary behavior and reduce verbosity, providers typically conceal the reasoning traces while returning only the final answer. This opacity introduces a critical transparency gap: users are billed for invisible reasoning tokens, which often account for the majority of the cost, yet have no means to verify their authenticity. This opens the door to token count inflation, where providers may overreport token usage or inject synthetic, low-effort tokens to inflate charges. To address this issue, we propose CoIn, a verification framework that audits both the quantity and semantic validity of hidden tokens. CoIn constructs a verifiable hash tree from token embedding fingerprints to check token counts, and uses embedding-based relevance matching to detect fabricated reasoning content. Experiments demonstrate that CoIn, when deployed as a trusted third-party auditor, can effectively detect token count inflation with a success rate reaching up to 94.7%, showing the strong ability to restore billing transparency in opaque LLM services. The dataset and code are available at https://github.com/CASE-Lab-UMD/LLM-Auditing-CoIn.

Paper Structure

This paper contains 27 sections, 2 equations, 12 figures, 5 tables, 4 algorithms.

Figures (12)

  • Figure 1: Ratio of reasoning tokens to answer tokens across datasets and deployed APIs. (a) Token ratios on the OpenR1-Math dataset across different OpenAI reasoning models. (b) Token ratios of the DeepSeek-R1 deepseekai2025deepseekr1incentivizingreasoningcapability across various reasoning datasets. In both cases, the number of reasoning tokens often exceeds answer tokens by an order of magnitude or more.
  • Figure 2: CoIn Framework.
  • Figure 3: Performance of CoIn across different inflation methods and verifiers. The red lines and the blue lines represent the DSR of rule-based verifier and learning-based verifier, respectively. $\gamma$
  • Figure 4: Impact of threshold $\tau$ on DSR.
  • Figure 5: Merkle Tree Construction Time with Fluctuation Range.
  • ...and 7 more figures