Routing, Cascades, and User Choice for LLMs

Rafid Mahmood

Routing, Cascades, and User Choice for LLMs

Rafid Mahmood

TL;DR

The paper addresses how to route tasks across two LLMs under reactive user behavior by formulating a Stackelberg game between a cost-minimizing provider and a utility-maximizing user. A Markov routing model yields closed-form expressions for user utility $U_i(s,q)$ and provider cost $J_i(s,q)$, parameterized by per-pass costs $c_i$, latencies $t_i$, success probabilities $p_i$, and net values $oldsymbol{\xi_i = V p_i - t_i}$. The key finding is that the provider’s optimal routing is typically static with little benefit from cascading, though misalignment with user incentives can arise, and latency throttling can paradoxically reduce provider costs in some regimes. The results translate into threshold-based routing rules that guide practical LLM service design, clarifying when cascading, throttling, or fixed routing helps or harms user welfare and provider economics.

Abstract

To mitigate the trade-offs between performance and costs, LLM providers route user tasks to different models based on task difficulty and latency. We study the effect of LLM routing with respect to user behavior. We propose a game between an LLM provider with two models (standard and reasoning) and a user who can re-prompt or abandon tasks if the routed model cannot solve them. The user's goal is to maximize their utility minus the delay from using the model, while the provider minimizes the cost of servicing the user. We solve this Stackelberg game by fully characterizing the user best response and simplifying the provider problem. We observe that in nearly all cases, the optimal routing policy involves a static policy with no cascading that depends on the expected utility of the models to the user. Furthermore, we reveal a misalignment gap between the provider-optimal and user-preferred routes when the user's and provider's rankings of the models with respect to utility and cost differ. Finally, we demonstrate conditions for extreme misalignment where providers are incentivized to throttle the latency of the models to minimize their costs, consequently depressing user utility. The results yield simple threshold rules for single-provider, single-user interactions and clarify when routing, cascading, and throttling help or harm.

Routing, Cascades, and User Choice for LLMs

TL;DR

and provider cost

, parameterized by per-pass costs

, latencies

, success probabilities

, and net values

. The key finding is that the provider’s optimal routing is typically static with little benefit from cascading, though misalignment with user incentives can arise, and latency throttling can paradoxically reduce provider costs in some regimes. The results translate into threshold-based routing rules that guide practical LLM service design, clarifying when cascading, throttling, or fixed routing helps or harms user welfare and provider economics.

Abstract

Paper Structure (19 sections, 13 theorems, 17 equations, 5 figures, 1 table)

This paper contains 19 sections, 13 theorems, 17 equations, 5 figures, 1 table.

Introduction
Related Literature
Main problem
Markov model of LLM routing with user response
User Problem
Provider Problem
Characterizing Provider and User Policies
User Best Response
Provider-optimal Policy
When are provider and user misaligned?
The risk of throttling latency
Conclusion
Protocol of LLM Usage
Helper Lemmas
Proofs of Main Results
...and 4 more sections

Key Result

Theorem 1

If the provider sets $i = 2$, then the user best response is $q^*(2, s) = \mathds{1} \left\{ \xi_2 < 0 \right\}$.

Figures (5)

Figure 1: Key guidelines for optimal routing in the face of reactive users. User behavior depends on the state of the two models in terms of providing utility versus the delay incurred from inference compute. Provider policies must follow different thresholding rules in each region.
Figure 2: Markov model of LLM provider-user interaction.
Figure 3: Heatmap of the user response when $i =1$ and $s = 0.25$. When both models are value-dominated (top right) or latency-dominated (bottom left), the user best response is static. When the models differ in their regime (top left and bottom right), the user response depends on $\xi_1, \xi_2$. Note that $q^* \in (0, 1)$ is in the interior only in certain regimes for $\xi_1 > 0 > \xi_2$.
Figure 4: Left: Heatmap of the provider-optimal policy for $\xi_1 < 0 < \xi_2$. Right: Heatmap of the provider-optimal policy for $\xi_1 > 0 > \xi_2$. For both plots, we hold $c_1 = 1$ constant and sweep the difference in cost-of-pass $c_1/p_1 - c_2/p_2$ as well as $P$. For most regimes, the optimal policy is either to route immediately to $M_1$ or to $M_2$ without any cascading. There exist only some regimes for $\xi_1 > 0 > \xi_2$ where the optimal policy involves probabilistic $s\in(0,1)$.
Figure 5: Left: Heatmap of the user misalignment gap for $\xi_1 < 0 < \xi_2$. Middle: Heatmap of the user misalignment gap for $\xi_1 > 0 > \xi_2$. Right: Heatmap of the effect of throttling on provider costs; the dashed line is the line $P=\min\{c_1/p_1, c_2/p_2\}$. For all plots, we hold $c_1 = 1$ constant and sweep the difference in cost-of-pass $c_1/p_1 - c_2/p_2$ as well as $P$.

Theorems & Definitions (26)

Theorem 1
Theorem 2
Theorem 3
Theorem 4
Theorem 5
Proposition 1
Proposition 2
Lemma 1
proof : Proof of Lemma \ref{['lem:objectives_monotone']}
Lemma 2
...and 16 more

Routing, Cascades, and User Choice for LLMs

TL;DR

Abstract

Routing, Cascades, and User Choice for LLMs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (26)