Table of Contents
Fetching ...

R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

Zhuokun Chen, Zeren Chen, Jiahao He, Lu Sheng, Mingkui Tan, Jianfei Cai, Bohan Zhuang

TL;DR

R-Stitch tackles the high latency of chain-of-thought reasoning in LLMs by introducing a training-free, entropy-guided hybrid decoding framework that partitions token-level work between a small language model (SLM) and a large language model (LLM). The method leverages per-token entropy, defined as $\mathcal{H}_t = \frac{-\sum_i p_{t,i}\log p_{t,i}}{\log V}$, to route low-entropy tokens to the SLM and high-entropy tokens to the LLM, with independent KV caches to minimize switching overhead. An extension, R-Stitch+, adds a latency-aware RL router that adaptively balances efficiency and accuracy beyond fixed thresholds, using a reward $R = r_{\text{acc}} + r_{\text{eff}}$ where $r_{\text{eff}} = - \lambda \cdot r_{\text{acc}} \cdot \widehat{L}$. Empirical results on multiple math benchmarks show substantial speedups (up to 4.10x for 32B models) while preserving accuracy near full LLM decoding, and the approach remains adaptable to different budgets without retraining. The work demonstrates that training-free, entropy-driven collaboration between heterogeneous models provides a practical path to scalable CoT reasoning in real-world deployments.

Abstract

Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided routing strategy that lets the SLM efficiently handle low-entropy tokens while delegating uncertain ones to the LLM, thereby avoiding full rollbacks and preserving answer quality. We further extend this design with R-Stitch$^{+}$, which learns an adaptive routing policy to adjust the token budget dynamically beyond fixed thresholds. By jointly reducing per-token decoding complexity and the number of generated tokens, our method achieves substantial acceleration with negligible accuracy loss. Concretely, it attains peak speedups of 3.00$\times$ on DeepSeek-R1-Distill-Qwen-7B, 3.85$\times$ on 14B, and 4.10$\times$ on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Moreover, it naturally enables adaptive efficiency--accuracy trade-offs that can be tailored to diverse computational budgets without retraining.

R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

TL;DR

R-Stitch tackles the high latency of chain-of-thought reasoning in LLMs by introducing a training-free, entropy-guided hybrid decoding framework that partitions token-level work between a small language model (SLM) and a large language model (LLM). The method leverages per-token entropy, defined as , to route low-entropy tokens to the SLM and high-entropy tokens to the LLM, with independent KV caches to minimize switching overhead. An extension, R-Stitch+, adds a latency-aware RL router that adaptively balances efficiency and accuracy beyond fixed thresholds, using a reward where . Empirical results on multiple math benchmarks show substantial speedups (up to 4.10x for 32B models) while preserving accuracy near full LLM decoding, and the approach remains adaptable to different budgets without retraining. The work demonstrates that training-free, entropy-driven collaboration between heterogeneous models provides a practical path to scalable CoT reasoning in real-world deployments.

Abstract

Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided routing strategy that lets the SLM efficiently handle low-entropy tokens while delegating uncertain ones to the LLM, thereby avoiding full rollbacks and preserving answer quality. We further extend this design with R-Stitch, which learns an adaptive routing policy to adjust the token budget dynamically beyond fixed thresholds. By jointly reducing per-token decoding complexity and the number of generated tokens, our method achieves substantial acceleration with negligible accuracy loss. Concretely, it attains peak speedups of 3.00 on DeepSeek-R1-Distill-Qwen-7B, 3.85 on 14B, and 4.10 on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Moreover, it naturally enables adaptive efficiency--accuracy trade-offs that can be tailored to diverse computational budgets without retraining.

Paper Structure

This paper contains 26 sections, 9 equations, 13 figures, 19 tables, 1 algorithm.

Figures (13)

  • Figure 1: Token-level consistency and speedup analysis. (a) shows the relationship between token-level consistency and speedup in speculative decoding across different LLM-SLM pairs on AMC. (b) presents the distribution of speedup ratios across individual samples from AMC. (c) illustrates the token counts for questions correctly answered by both the SLM and LLM.
  • Figure 2: Overview of R-Stitch. Given a question with CoT prompting, decoding alternates between an SLM and an LLM under an entropy-based switching policy. Generation starts with the SLM; tokens with low entropy are accepted directly, while high-entropy tokens trigger the LLM to overwrite them and resume decoding. Symmetrically, when the LLM outputs a low-entropy token, it returns to the SLM to reduce computational cost. This bidirectional mechanism adaptively allocates computation, preserving SLM efficiency while leveraging LLM reliability when needed.
  • Figure 3: Sample-level entropy in correct vs. incorrect solutions
  • Figure 4: Token-level entropy distribution across full reasoning traces
  • Figure 5: Elevated entropy around the first harmful token
  • ...and 8 more figures