Table of Contents
Fetching ...

Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router

Chenyang Shao, Xinyang Liu, Yutang Lin, Fengli Xu, Yong Li

TL;DR

R2-Reasoner tackles the high cost of chain of thought reasoning by introducing a reinforced model router that handles subtask decomposition and model allocation across heterogeneous LLMs. The approach uses a staged training pipeline combining supervised fine tuning and reinforcement learning with Group Relative Policy Optimization to coordinate a decomposer and an allocator. Experiments across six benchmarks demonstrate large API cost savings while preserving competitive accuracy and generalizing to unseen models. The work provides a practical pathway to scalable, budget aware reasoning in real world deployments.

Abstract

Chain-of-thought has been proven essential for enhancing the complex reasoning abilities of Large Language Models (LLMs), but it also leads to high computational costs. Recent advances have explored the method to route queries among multiple models and proved it as a promising approach. However, previous works directly operate at the task level, i.e., assigning user queries to suitable LLMs, which does not allow hybrid LLMs to truly collaborate on finer-grained sub-tasks. Collaboration at the level of intermediate reasoning steps (thoughts) could enable more efficient coordination, but it also poses significant challenges for router scheduling, placing immense demands on the quality of task decomposition and the precision of the router. To address this, we propose R2-Reasoner, a novel framework centered around a Reinforced Model Router designed to efficiently scale LLM reasoning. This router orchestrates collaboration across nine heterogeneous models, whose parameter scales range from less than 1B to hundreds of billions, by first breaking down a complex query into subtasks with a decomposer, and then assigning each subtask to the optimal model with a subtask allocator, balancing performance with cost. Training this router involves a two-stage alternating process for the decomposer and the allocator, integrating supervised fine-tuning with reinforcement learning to enable effective self-supervised refinement. Extensive experiments across six challenging reasoning benchmarks demonstrate that R2-Reasoner reduces API costs by 84.46% compared with state-of-the-art baselines while maintaining competitive reasoning accuracy. Our framework paves the way for the development of more scalable and efficient reasoning systems. Our code is open-source at https://anonymous.4open.science/r/R2_Reasoner.

Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router

TL;DR

R2-Reasoner tackles the high cost of chain of thought reasoning by introducing a reinforced model router that handles subtask decomposition and model allocation across heterogeneous LLMs. The approach uses a staged training pipeline combining supervised fine tuning and reinforcement learning with Group Relative Policy Optimization to coordinate a decomposer and an allocator. Experiments across six benchmarks demonstrate large API cost savings while preserving competitive accuracy and generalizing to unseen models. The work provides a practical pathway to scalable, budget aware reasoning in real world deployments.

Abstract

Chain-of-thought has been proven essential for enhancing the complex reasoning abilities of Large Language Models (LLMs), but it also leads to high computational costs. Recent advances have explored the method to route queries among multiple models and proved it as a promising approach. However, previous works directly operate at the task level, i.e., assigning user queries to suitable LLMs, which does not allow hybrid LLMs to truly collaborate on finer-grained sub-tasks. Collaboration at the level of intermediate reasoning steps (thoughts) could enable more efficient coordination, but it also poses significant challenges for router scheduling, placing immense demands on the quality of task decomposition and the precision of the router. To address this, we propose R2-Reasoner, a novel framework centered around a Reinforced Model Router designed to efficiently scale LLM reasoning. This router orchestrates collaboration across nine heterogeneous models, whose parameter scales range from less than 1B to hundreds of billions, by first breaking down a complex query into subtasks with a decomposer, and then assigning each subtask to the optimal model with a subtask allocator, balancing performance with cost. Training this router involves a two-stage alternating process for the decomposer and the allocator, integrating supervised fine-tuning with reinforcement learning to enable effective self-supervised refinement. Extensive experiments across six challenging reasoning benchmarks demonstrate that R2-Reasoner reduces API costs by 84.46% compared with state-of-the-art baselines while maintaining competitive reasoning accuracy. Our framework paves the way for the development of more scalable and efficient reasoning systems. Our code is open-source at https://anonymous.4open.science/r/R2_Reasoner.

Paper Structure

This paper contains 39 sections, 12 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of Our R2-Reasoner Framework
  • Figure 2: Overview of Our Grouped Search Strategy for Optimal Allocation Scheme
  • Figure 3: Acc-Cost trade-off curves on MATH (left) and SCAN (right). A magnified inset is provided to the right of the original sub-figure to more precisely illustrate the Pareto frontier of our method.
  • Figure 4: Inference latency comparison of different methods across three benchmarks.