Table of Contents
Fetching ...

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

Annette Taberner-Miller

Abstract

Production LLM serving often relies on multi-model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade-off is non-stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that is the first to simultaneously enforce dollar-denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal-dual budget pacer enforces a per-request cost ceiling over an open-ended stream, replacing offline penalty tuning with closed-loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot-swap registry lets operators add or remove models at runtime, with a brief forced-exploration phase for each newcomer, after which UCB selection discovers its quality-cost niche from live traffic alone. We evaluate ParetoBandit across four deployment scenarios on 1,824 prompts routed through a three-model portfolio. Across seven budget ceilings, mean per-request cost never exceeds the target by more than 0.4%. When conditions shift, the system adapts: an order-of-magnitude price cut on the costliest model yields up to +0.071 quality lift, and a silent quality regression is detected and rerouted within budget. A cold-started model reaches meaningful adoption within ~142 steps without breaching the cost ceiling. The router discriminates rather than blindly adopting: expensive models are budget-gated and low-quality models rejected after bounded exploration. End-to-end routing latency is 9.8ms on CPU -- less than 0.4% of typical inference time -- with the routing decision itself taking just 22.5us.

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

Abstract

Production LLM serving often relies on multi-model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade-off is non-stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that is the first to simultaneously enforce dollar-denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal-dual budget pacer enforces a per-request cost ceiling over an open-ended stream, replacing offline penalty tuning with closed-loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot-swap registry lets operators add or remove models at runtime, with a brief forced-exploration phase for each newcomer, after which UCB selection discovers its quality-cost niche from live traffic alone. We evaluate ParetoBandit across four deployment scenarios on 1,824 prompts routed through a three-model portfolio. Across seven budget ceilings, mean per-request cost never exceeds the target by more than 0.4%. When conditions shift, the system adapts: an order-of-magnitude price cut on the costliest model yields up to +0.071 quality lift, and a silent quality regression is detected and rerouted within budget. A cold-started model reaches meaningful adoption within ~142 steps without breaching the cost ceiling. The router discriminates rather than blindly adopting: expensive models are budget-gated and low-quality models rejected after bounded exploration. End-to-end routing latency is 9.8ms on CPU -- less than 0.4% of typical inference time -- with the routing decision itself taking just 22.5us.

Paper Structure

This paper contains 88 sections, 10 equations, 15 figures, 12 tables, 1 algorithm.

Figures (15)

  • Figure 1: Quality--cost Pareto frontier under stationary conditions ($K{=}3$, 20 seeds). (a) ParetoBandit with BudgetPacer (blue curve) traces a continuous quality--cost frontier by accepting a dollar budget ceiling. Fixed single-model baselines shown as stars. (b) Budget compliance: realized cost vs. ceiling. Green band marks $\pm5\%$. (c) Model allocation shifts from Llama-dominant at tight budgets to Gemini-heavy at loose budgets.
  • Figure 2: Adaptation dynamics under cost drift ($K{=}3$, 20 seeds, 95% bootstrap CI). Normal pricing $\to$ Gemini price drop at prompt $608$$\to$ price restored at prompt $2{\times}608$. (a) Gemini-Pro selection fraction: all budget tiers surge toward Gemini during the price drop, then revert when pricing is restored. (b) Windowed mean reward: all conditions improve during Phase 2 as Gemini becomes budget-accessible. (c) Windowed average cost per request (dotted = budget ceilings): ParetoBandit tracks the ceiling in all three phases; the Forgetting Bandit overshoots from Phase 1 onward.
  • Figure 3: Silent quality degradation dynamics (20 seeds, 95% bootstrap CI). Mistral-Large degrades at prompt $608$ and recovers at $2{\times}608$; cost is unchanged. (a) Gemini-Pro selection fraction rises modestly under budget constraints and sharply (${\sim}17$ pp) without one. (b) Windowed reward: loose and unconstrained conditions recover fully in Phase 3; tighter budgets are still converging at the horizon limit (Appendix \ref{['appendix:recovery_limit']}). (c) Cost per request (dotted lines = ceilings): ParetoBandit holds compliance throughout, confirming that the degradation is invisible to the cost signal.
  • Figure 4: Model onboarding ($K{=}3 \to K{=}4$; 20 seeds, 95% bootstrap CI). Flash windowed selection fraction across four budget levels after cold-start addition. (a) Good & cheap: Flash is adopted at all budgets. (b) Good & expensive: the BudgetPacer suppresses Flash under tight budgets. (c) Bad & cheap: the bandit correctly rejects Flash in every seed.
  • Figure 5: Budget--quality trade-off during model onboarding (Good & Cheap; 20 seeds, 95% bootstrap CI). (a) Running cost per request (dotted = targets). Tight and moderate track their ceilings through the $K{=}3 \to K{=}4$ transition; loose settles below target as its constraint is only weakly binding. (b) Cumulative reward. Tight dips briefly at the Phase 2 boundary (20 forced-exploration pulls) before recovering; moderate rises as Flash fills a quality niche previously inaccessible at that cost tier.
  • ...and 10 more figures