Table of Contents
Fetching ...

The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More

Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia, James Zou

Abstract

Developers and consumers increasingly choose reasoning language models (RLMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning. We uncover the pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 78% cheaper than GPT-5.2's, yet its actual cost across all tasks is 22% higher. We trace the root cause to vast heterogeneity in thinking token consumption: on the same query, one model may use 900% more thinking tokens than another. In fact, removing thinking token costs reduces ranking reversals by 70% and raises the rank correlation (Kendall's $τ$ ) between price and cost rankings from 0.563 to 0.873. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.

The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More

Abstract

Developers and consumers increasingly choose reasoning language models (RLMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning. We uncover the pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 78% cheaper than GPT-5.2's, yet its actual cost across all tasks is 22% higher. We trace the root cause to vast heterogeneity in thinking token consumption: on the same query, one model may use 900% more thinking tokens than another. In fact, removing thinking token costs reduces ranking reversals by 70% and raises the rank correlation (Kendall's ) between price and cost rankings from 0.563 to 0.873. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.
Paper Structure (32 sections, 2 equations, 7 figures, 6 tables)

This paper contains 32 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The phenomenon of mismatch between AI model pricing and their actual costs. (a) On the same user workloads, AI models with lower listed prices may incur much higher expenses than those with higher prices. For example, Gemini 3 Flash's list price ($3.5/1 million tokens) is 78% cheaper than that of GPT-5.2 ($15.75), but its actual cost ($643) is actually 22% higher than GPT-5.2 ($527). (b) This dramatically changes the cost ranking and poses a pressing challenge to cost-sensitive users. For example, one might choose GPT-5 Mini over Claude Haiku 4.5 due to its listed lower price, but recognize later that it is 43% more expensive on her workload.
  • Figure 2: The ranking inversion phenomenon. Overall, we observe that the listed price rankings systematically mismatch the actual costs. In addition, the actual cost rankings vary substantially across different tasks. This suggests that standard assessment according to a fixed listed API pricing is misleading.
  • Figure 3: Cost and token consumption breakdown by token types. Thinking tokens dominate both token volume and total cost for most models, establishing them as the primary candidate for explaining pricing reversal.
  • Figure 4: Case study: on the same AIME problem, GPT-5.2 uses 562 thinking tokens while Gemini 3 Flash uses over 11,000, leading to 2.5$\times$ higher actual cost despite lower API pricing. The mechanism of reversal is the enormous cross-model variance in thinking token consumption.
  • Figure 5: Ablation study: removing thinking token costs from actual cost computation. (a) Kendall's $\tau$ between listed price ranking and actual cost ranking increases substantially across all tasks. (b) The number of pairwise ranking reversals drops by 70% on average, confirming that thinking tokens are the primary cause of pricing reversal.
  • ...and 2 more figures