Table of Contents
Fetching ...

R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang

TL;DR

R2R tackles the high inference cost of large language models by routing only path-divergent tokens to a large model while the lightweight SLM handles the majority of generation. It introduces a data-labeling pipeline and a 56M-parameter neural router trained on millions of token-level labels to predict divergence, enabling real-time token-level routing with minimal overhead. Empirical results on math, coding, and QA benchmarks show that R2R improves the accuracy-efficiency Pareto frontier, achieving substantial speedups and memory savings with limited LLM usage. The approach generalizes across model families and remains compatible with mixture-of-experts and other efficiency techniques, offering a practical path to scalable, high-quality mixed inference.

Abstract

Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.

R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

TL;DR

R2R tackles the high inference cost of large language models by routing only path-divergent tokens to a large model while the lightweight SLM handles the majority of generation. It introduces a data-labeling pipeline and a 56M-parameter neural router trained on millions of token-level labels to predict divergence, enabling real-time token-level routing with minimal overhead. Empirical results on math, coding, and QA benchmarks show that R2R improves the accuracy-efficiency Pareto frontier, achieving substantial speedups and memory savings with limited LLM usage. The approach generalizes across model families and remains compatible with mixture-of-experts and other efficiency techniques, offering a practical path to scalable, high-quality mixed inference.

Abstract

Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.

Paper Structure

This paper contains 61 sections, 16 equations, 13 figures, 18 tables, 3 algorithms.

Figures (13)

  • Figure 1: (a) Examples of R2R routing objective. Given a partial response as context, if SLM next-token prediction is not identical with LLM's, it is further categorized as neutral or divergent based on their effects on the reasoning path. (b) Distribution of identical, neutral and divergent labels in the R2R training set with 7.6M token labels.
  • Figure 2: (a) R2R uses neural router to inspect SLM outputs at each step, immediately corrects divergent tokens with LLM, then continues generation from the corrected outputs. (b) Speculative decoding uses LLM to periodically verify if SLM outputs are identical to LLM predictions, invalidating all tokens after the first correction within the period.
  • Figure 3: R2R data labeling pipeline. Given a query question, the LLM first generates a response to establish the desired reasoning path. The SLM then prefills this path to identify identical and different next-token predictions. For each different SLM token, the LLM continues generation from that point. Finally, a verifier model determines whether each difference leads to a neutral or divergent outcome, labeling the model preference as SLM or LLM, respectively.
  • Figure 3: Comparison of latency, output token length, and average speed across methods. Subscripts note the standard deviations across AIME.
  • Figure 4: Oracle insights for router design. (a) SLM entropy distribution, clipped at 99th percentile for visualization clarity (b) Divergence rate and frequency of different tokens.
  • ...and 8 more figures