xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning
Cheng Qian, Zuxin Liu, Shirley Kokane, Akshara Prabhakar, Jielin Qiu, Haolin Chen, Zhiwei Liu, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang
TL;DR
The paper tackles the cost–performance trade-off in multi-model LLM deployments by framing routing as a sequential decision problem learned via reinforcement learning. It introduces xRouter, a router that can directly answer or call external models through a cost-aware reward, enabling principled, end-to-end optimization without hand-engineered routing rules. The authors provide a complete RL framework, data pipeline, cost accounting, and deployment/evaluation infrastructure, demonstrating strong cost–performance trade-offs across diverse benchmarks and offering insights into model trainability and orchestration limits. Key findings show that end-to-end training improves routing quality, moderate cost penalties yield near-optimal efficiency, and adaptive routing can generalize across tasks and model pools, while acknowledging limitations in achieving sophisticated multi-step orchestration. The work advances practical, economically grounded LLM orchestration and supplies an open implementation to catalyze further research and real-world adoption.
Abstract
Modern LLM deployments confront a widening cost-performance spectrum: premium models deliver strong reasoning but are expensive, while lightweight models are economical yet brittle on complex tasks. Static escalation rules and keyword heuristics under-utilize this spectrum and fail to adapt across task types. We present xRouter, a tool-calling-based routing system in which a learned router can either answer directly or invoke one or more external models. The router is trained end-to-end with reinforcement learning using an explicit, cost-aware reward that encodes cost-performance trade-offs, eliminating the need for hand-engineered routing rules. Our implementation encompasses the full reinforcement learning framework, including reward and cost accounting, as well as the deployment and evaluation pipelines. Across diverse benchmarks, xRouter achieves strong cost-performance trade-offs (e.g., substantial cost reductions at comparable task completion rates), and provides empirical insights into what reliably helps learned routing and what does not, ranging from model trainability to the difficulty of eliciting sophisticated orchestration behaviors in small open models. We hope these findings and our open implementation will serve as a practical substrate for advancing learned, cost-aware LLM orchestration.
