Table of Contents
Fetching ...

Near-Optimal Online Deployment and Routing for Streaming LLMs

Shaoang Li, Jian Li

TL;DR

This work addresses online deployment and routing for streaming LLMs under a hard concurrency cap and long-term budget. It introduces StageRoute, a two-level algorithm that (i) makes stage-wise deployment decisions using optimistic reward bounds and conservative cost bounds, and (ii) performs per-query routing within the deployed set via a budget- and throughput-constrained LP. The authors prove a near-optimal regret bound of $\tilde{\mathcal{O}}(T^{2/3})$, with a matching lower bound, and validate the approach empirically on realistic workloads, showing StageRoute can closely track an oracle under tight budgets. By coupling stage-based commitment with real-time routing under budget and throughput constraints, this framework offers a practical, theoretically grounded solution for dynamic LLM ecosystems. The results have implications for scalable, cost-efficient deployment of evolving LLM portfolios in production systems.

Abstract

The rapid pace at which new large language models (LLMs) appear, and older ones become obsolete, forces providers to manage a streaming inventory under a strict concurrency cap and per-query cost budgets. We cast this as an online decision problem that couples stage-wise deployment (at fixed maintenance windows) with per-query routing among live models. We introduce StageRoute, a hierarchical algorithm that (i) optimistically selects up to $M_{\max}$ models for the next stage using reward upper-confidence and cost lower-confidence bounds, and (ii) routes each incoming query by solving a budget- and throughput-constrained bandit subproblem over the deployed set. We prove a regret of $\tilde{\mathcal{O}}(T^{2/3})$ with a matching lower bound, establishing near-optimality, and validate the theory empirically: StageRoute tracks a strong oracle under tight budgets across diverse workloads.

Near-Optimal Online Deployment and Routing for Streaming LLMs

TL;DR

This work addresses online deployment and routing for streaming LLMs under a hard concurrency cap and long-term budget. It introduces StageRoute, a two-level algorithm that (i) makes stage-wise deployment decisions using optimistic reward bounds and conservative cost bounds, and (ii) performs per-query routing within the deployed set via a budget- and throughput-constrained LP. The authors prove a near-optimal regret bound of , with a matching lower bound, and validate the approach empirically on realistic workloads, showing StageRoute can closely track an oracle under tight budgets. By coupling stage-based commitment with real-time routing under budget and throughput constraints, this framework offers a practical, theoretically grounded solution for dynamic LLM ecosystems. The results have implications for scalable, cost-efficient deployment of evolving LLM portfolios in production systems.

Abstract

The rapid pace at which new large language models (LLMs) appear, and older ones become obsolete, forces providers to manage a streaming inventory under a strict concurrency cap and per-query cost budgets. We cast this as an online decision problem that couples stage-wise deployment (at fixed maintenance windows) with per-query routing among live models. We introduce StageRoute, a hierarchical algorithm that (i) optimistically selects up to models for the next stage using reward upper-confidence and cost lower-confidence bounds, and (ii) routes each incoming query by solving a budget- and throughput-constrained bandit subproblem over the deployed set. We prove a regret of with a matching lower bound, establishing near-optimality, and validate the theory empirically: StageRoute tracks a strong oracle under tight budgets across diverse workloads.

Paper Structure

This paper contains 30 sections, 14 theorems, 69 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Consider StageRoute running for $T$ queries divided into $K$ stages, with a concurrency cap $M_{\max}$ and $N=|\mathcal{M}_T|$ total models arriving over time. Set the confidence parameter to $\gamma=\Theta(\log(NT/\delta))$ to obtain overall confidence $1-\delta$. Then the expected regret is bounde Choosing $K=\Theta(T^{1/3})$ and $M_{\max}=\Omega(N^{2/3})$ yields $\textnormal{Regret}(\textnormal

Figures (6)

  • Figure 1: The StageRoute workflow. Newly released LLMs (green) continually enter the candidate pool. At each scheduled update point, StageRoute deploys up to $M_{\max}$ models (blue). Between updates, each query is routed among the current deployment (orange). This two-level loop assimilates fresh models, enforces cost/throughput constraints, and adapts routing in real time.
  • Figure 2: Comparison of decision heatmaps for StageRoute and the Oracle with $M_{\max}=5, b=0.001$, update interval=1000. Darker colors indicate higher selection probabilities.
  • Figure 3: Cumulative regret.
  • Figure 4: Performance-cost evolution of different algorithms.
  • Figure 5: Cumulative regret under varying hyperparameters. The default setting is $M_{\max}=5$, update interval = 1000 rounds, and $b=0.001$ (Figure \ref{['fig:mmax5']}).
  • ...and 1 more figures

Theorems & Definitions (26)

  • Theorem 1
  • Theorem 2
  • Lemma 1: DBLP:conf/stoc/KleinbergSU08DBLP:journals/teco/BabaioffDKS15
  • Lemma 2: DBLP:journals/teco/BabaioffDKS15, adapted
  • Lemma 3: DBLP:journals/teco/BabaioffDKS15, adapted
  • proof
  • Definition 1: Optimal Performance within Deployed Set
  • Definition 2: Algorithm Performance
  • Lemma 4: Regret Decomposition with Time-Varying Benchmark
  • proof
  • ...and 16 more