Table of Contents
Fetching ...

MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees

Herbert Woisetschläger, Ryan Zhang, Shiqiang Wang, Hans-Arno Jacobsen

TL;DR

MESS+ addresses the challenge of cost-effective LLM routing under service-level guarantees in open-weight model zoos. It combines online learning of request satisfaction with a Lyapunov drift-plus-penalty framework to select models per request, ensuring SLA compliance while minimizing energy/cost via a per-request optimization driven by predictions $\hat{s}_{m,t}$ and a virtual queue $Q_t$. Theoretical analysis provides bounds linking SLA satisfaction and cost optimality, and empirical evaluations show about $2\times$ cost reductions versus baselines across diverse benchmarks, including large zoos and non-stationary settings. The approach offers a practical, scalable mechanism for production endpoints to balance quality of service with energy efficiency without requiring offline preference datasets.

Abstract

Open-weight large language model (LLM) zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of $2\times$ cost savings compared to existing LLM routing techniques.

MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees

TL;DR

MESS+ addresses the challenge of cost-effective LLM routing under service-level guarantees in open-weight model zoos. It combines online learning of request satisfaction with a Lyapunov drift-plus-penalty framework to select models per request, ensuring SLA compliance while minimizing energy/cost via a per-request optimization driven by predictions and a virtual queue . Theoretical analysis provides bounds linking SLA satisfaction and cost optimality, and empirical evaluations show about cost reductions versus baselines across diverse benchmarks, including large zoos and non-stationary settings. The approach offers a practical, scalable mechanism for production endpoints to balance quality of service with energy efficiency without requiring offline preference datasets.

Abstract

Open-weight large language model (LLM) zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of cost savings compared to existing LLM routing techniques.

Paper Structure

This paper contains 26 sections, 6 theorems, 49 equations, 7 figures, 14 tables.

Key Result

Theorem 1

For any $t\geq 1$, we have the following upper bounds on the virtual queue length: where $\Delta_E := E_{\max} - E_{\min}$, in which $E_{\mathrm{max}}$ and $E_{\mathrm{min}}$ are the maximum and minimum operating costs of any model, respectively.

Figures (7)

  • Figure 1: OpenLLM-Leaderboard performance comparison of popular LLM families. Each family typically consists of a minimum of three models with distinct capabilities and cost characteristics.
  • Figure 2: Model SElection with Cost-optimal Service-level GuaranteeS (MESS+)
  • Figure 3: We run several experiments on the Winogrande benchmark with varying $\alpha$ and $V$ configurations to show the request satisfaction and cost dynamics over time. With MESS+, the average request satisfaction rate always converges toward $\alpha$. We further report the first step at which the highest $V$ value satisfies our SLA requirement. Other benchmarks are in the appendix.
  • Figure 4: Predictor training performance, averaged across all 8 benchmarks. We control the exploration probability of MESS+ with $c$. Our predictor learns effectively with a small $c$ already.
  • Figure C.1: Full overview of predictor training cost across all benchmarks used in our paper.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Lemma A.3
  • proof
  • proof : Proof of \ref{['theorem:optimality_bound']}