Table of Contents
Fetching ...

Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

Haozhen Zhang, Tao Feng, Jiaxuan You

TL;DR

Router-R1 introduces a reinforcement learning–based framework for coordinating multiple LLMs through multi-round routing, where the router itself is an LLM that interleaves internal thinking with targeted model calls. A lightweight, rule-based reward combining format, final outcome, and cost components guides the policy to balance accuracy and expenditure, while descriptor-based generalization enables adaptation to unseen LLMs without retraining. Evaluated on seven diverse QA benchmarks, Router-R1 achieves state-of-the-art performance and demonstrates robust generalization and cost management across varying model pools. The work highlights the feasibility and benefits of RL-driven, multi-round LLM coordination for complex tasks that require model collaboration.

Abstract

The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To facilitate learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for optimizing the balance between performance and cost, opening a pathway toward enhancing performance-cost trade-offs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms several strong baselines, achieving superior performance while maintaining robust generalization and cost management.

Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

TL;DR

Router-R1 introduces a reinforcement learning–based framework for coordinating multiple LLMs through multi-round routing, where the router itself is an LLM that interleaves internal thinking with targeted model calls. A lightweight, rule-based reward combining format, final outcome, and cost components guides the policy to balance accuracy and expenditure, while descriptor-based generalization enables adaptation to unseen LLMs without retraining. Evaluated on seven diverse QA benchmarks, Router-R1 achieves state-of-the-art performance and demonstrates robust generalization and cost management across varying model pools. The work highlights the feasibility and benefits of RL-driven, multi-round LLM coordination for complex tasks that require model collaboration.

Abstract

The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To facilitate learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for optimizing the balance between performance and cost, opening a pathway toward enhancing performance-cost trade-offs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms several strong baselines, achieving superior performance while maintaining robust generalization and cost management.

Paper Structure

This paper contains 45 sections, 8 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Router-R1 architecture. (a) Single-round Routing: A conventional router assigns each query to a single LLM in isolation via a one-shot decision, without internal reasoning or multi-model coordination. (b) Multi-round Routing (ours): Router-R1 casts multi-LLM routing as a sequential decision process, which leverages an LLM-based router to interleave internal reasoning with external LLM routing and integrates retrieved information into its evolving context. This enables adaptive multi-model coordination for complex tasks, surpassing single-round routing with better performance.
  • Figure 2: Training prompt template for Router-R1 (some texts are omitted for page space).
  • Figure 3: Analysis of cost rewards on the NQ, PopQA, HotpotQA (HpQA), and 2WikiMultiHopQA (2wiki) datasets.
  • Figure 4: Analysis of LLM API call count and Router-R1 training convergence.
  • Figure 5: Cost vs. Performance Pareto Curve on NQ, PopQA, HpQA, and 2wiki datasets w.r.t. Exact Match and raw cost rewards (unnormalized).