Table of Contents
Fetching ...

OmniRouter: Budget and Performance Controllable Multi-LLM Routing

Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, Yongfeng Zhang

TL;DR

OmniRouter tackles the challenge of allocating queries to multiple LLMs under budget and performance constraints. It reframes routing as a constrained optimization problem and introduces a two-stage architecture: a retrieval-augmented predictor to estimate model capabilities and costs, and a Lagrangian dual-based optimizer to yield globally cost-efficient allocations. Empirical results show up to 6.3% improvement in response accuracy and at least 10.15% cost savings over competitive baselines, with strong controllability under varying α and L. The work offers a practical mechanism for budget-aware, quality-guaranteed multi-LLM serving and provides theoretical insights via proofs of optimal assignment under the dual variables.

Abstract

Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency, while smaller models can efficiently handle simpler tasks with fewer resources. LLM routing is a crucial paradigm that dynamically selects the most suitable large language models from a pool of candidates to process diverse inputs, ensuring optimal resource utilization while maintaining response quality. Existing routing frameworks typically model this as a locally optimal decision-making problem, selecting the presumed best-fit LLM for each query individually, which overlooks global budget constraints, resulting in ineffective resource allocation. To tackle this problem, we introduce OmniRouter, a fundamentally controllable routing framework for multi-LLM serving. Instead of making per-query greedy choices, OmniRouter models the routing task as a constrained optimization problem, assigning models that minimize total cost while ensuring the required performance level. Specifically, a hybrid retrieval-augmented predictor is designed to predict the capabilities and costs of LLMs. After obtaining the predicted cost and performance, we utilize a constrained optimizer for cost-optimal assignments that employs Lagrangian dual decomposition with adaptive multipliers. It iteratively converges toward the globally optimal query-model allocation, dynamically balancing latency minimization against quality thresholds while adhering to heterogeneous capacity constraints. Experiments show that OmniRouter achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15% compared to competitive router baselines. The code and the dataset are available at https://github.com/dongyuanjushi/OmniRouter.

OmniRouter: Budget and Performance Controllable Multi-LLM Routing

TL;DR

OmniRouter tackles the challenge of allocating queries to multiple LLMs under budget and performance constraints. It reframes routing as a constrained optimization problem and introduces a two-stage architecture: a retrieval-augmented predictor to estimate model capabilities and costs, and a Lagrangian dual-based optimizer to yield globally cost-efficient allocations. Empirical results show up to 6.3% improvement in response accuracy and at least 10.15% cost savings over competitive baselines, with strong controllability under varying α and L. The work offers a practical mechanism for budget-aware, quality-guaranteed multi-LLM serving and provides theoretical insights via proofs of optimal assignment under the dual variables.

Abstract

Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency, while smaller models can efficiently handle simpler tasks with fewer resources. LLM routing is a crucial paradigm that dynamically selects the most suitable large language models from a pool of candidates to process diverse inputs, ensuring optimal resource utilization while maintaining response quality. Existing routing frameworks typically model this as a locally optimal decision-making problem, selecting the presumed best-fit LLM for each query individually, which overlooks global budget constraints, resulting in ineffective resource allocation. To tackle this problem, we introduce OmniRouter, a fundamentally controllable routing framework for multi-LLM serving. Instead of making per-query greedy choices, OmniRouter models the routing task as a constrained optimization problem, assigning models that minimize total cost while ensuring the required performance level. Specifically, a hybrid retrieval-augmented predictor is designed to predict the capabilities and costs of LLMs. After obtaining the predicted cost and performance, we utilize a constrained optimizer for cost-optimal assignments that employs Lagrangian dual decomposition with adaptive multipliers. It iteratively converges toward the globally optimal query-model allocation, dynamically balancing latency minimization against quality thresholds while adhering to heterogeneous capacity constraints. Experiments show that OmniRouter achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15% compared to competitive router baselines. The code and the dataset are available at https://github.com/dongyuanjushi/OmniRouter.

Paper Structure

This paper contains 14 sections, 34 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison between traditional greedy routers and OmniRouter. Left: Greedy routers select models based on per-query optimization, leading to suboptimal allocations where Q1 (simple query) comes first and is assigned to the strong LLM, blocking Q2 (complex query) from accessing it. As a result, Q2 fails when assigned to the weak LLM. Right: OmniRouter employs constrained optimization to consider the global query distribution and model capabilities. It assigns the simple query to LLM1 (sufficient for the task) and reserves LLM2 for the complex query, thus maximizing overall success rate.
  • Figure 2: Illustration of OmniRouter, including the hybrid predictor and constrained optimizer.
  • Figure 3: Distribution of OmniRouter's query routing decisions across difficulty levels and model capabilities.
  • Figure 4: Impact of performance constraint ($\alpha$) on cost efficiency and routing accuracy. As performance requirements increase, greedy methods exhibit unbounded cost growth while OmniRouter's constraint optimization maintains controlled scaling.
  • Figure 5: Impact of concurrency constraint ($L$) on cost efficiency and routing accuracy. As available parallelism decreases, greedy methods struggle to make effective compromises, while OmniRouter's constraint optimization maintains balanced allocations.

Theorems & Definitions (2)

  • proof
  • proof