Near-Optimal Online Deployment and Routing for Streaming LLMs
Shaoang Li, Jian Li
TL;DR
This work addresses online deployment and routing for streaming LLMs under a hard concurrency cap and long-term budget. It introduces StageRoute, a two-level algorithm that (i) makes stage-wise deployment decisions using optimistic reward bounds and conservative cost bounds, and (ii) performs per-query routing within the deployed set via a budget- and throughput-constrained LP. The authors prove a near-optimal regret bound of $\tilde{\mathcal{O}}(T^{2/3})$, with a matching lower bound, and validate the approach empirically on realistic workloads, showing StageRoute can closely track an oracle under tight budgets. By coupling stage-based commitment with real-time routing under budget and throughput constraints, this framework offers a practical, theoretically grounded solution for dynamic LLM ecosystems. The results have implications for scalable, cost-efficient deployment of evolving LLM portfolios in production systems.
Abstract
The rapid pace at which new large language models (LLMs) appear, and older ones become obsolete, forces providers to manage a streaming inventory under a strict concurrency cap and per-query cost budgets. We cast this as an online decision problem that couples stage-wise deployment (at fixed maintenance windows) with per-query routing among live models. We introduce StageRoute, a hierarchical algorithm that (i) optimistically selects up to $M_{\max}$ models for the next stage using reward upper-confidence and cost lower-confidence bounds, and (ii) routes each incoming query by solving a budget- and throughput-constrained bandit subproblem over the deployed set. We prove a regret of $\tilde{\mathcal{O}}(T^{2/3})$ with a matching lower bound, establishing near-optimality, and validate the theory empirically: StageRoute tracks a strong oracle under tight budgets across diverse workloads.
