OmniRouter: Budget and Performance Controllable Multi-LLM Routing
Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, Yongfeng Zhang
TL;DR
OmniRouter tackles the challenge of allocating queries to multiple LLMs under budget and performance constraints. It reframes routing as a constrained optimization problem and introduces a two-stage architecture: a retrieval-augmented predictor to estimate model capabilities and costs, and a Lagrangian dual-based optimizer to yield globally cost-efficient allocations. Empirical results show up to 6.3% improvement in response accuracy and at least 10.15% cost savings over competitive baselines, with strong controllability under varying α and L. The work offers a practical mechanism for budget-aware, quality-guaranteed multi-LLM serving and provides theoretical insights via proofs of optimal assignment under the dual variables.
Abstract
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency, while smaller models can efficiently handle simpler tasks with fewer resources. LLM routing is a crucial paradigm that dynamically selects the most suitable large language models from a pool of candidates to process diverse inputs, ensuring optimal resource utilization while maintaining response quality. Existing routing frameworks typically model this as a locally optimal decision-making problem, selecting the presumed best-fit LLM for each query individually, which overlooks global budget constraints, resulting in ineffective resource allocation. To tackle this problem, we introduce OmniRouter, a fundamentally controllable routing framework for multi-LLM serving. Instead of making per-query greedy choices, OmniRouter models the routing task as a constrained optimization problem, assigning models that minimize total cost while ensuring the required performance level. Specifically, a hybrid retrieval-augmented predictor is designed to predict the capabilities and costs of LLMs. After obtaining the predicted cost and performance, we utilize a constrained optimizer for cost-optimal assignments that employs Lagrangian dual decomposition with adaptive multipliers. It iteratively converges toward the globally optimal query-model allocation, dynamically balancing latency minimization against quality thresholds while adhering to heterogeneous capacity constraints. Experiments show that OmniRouter achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15% compared to competitive router baselines. The code and the dataset are available at https://github.com/dongyuanjushi/OmniRouter.
