R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning
Zhuokun Chen, Zeren Chen, Jiahao He, Lu Sheng, Mingkui Tan, Jianfei Cai, Bohan Zhuang
TL;DR
R-Stitch tackles the high latency of chain-of-thought reasoning in LLMs by introducing a training-free, entropy-guided hybrid decoding framework that partitions token-level work between a small language model (SLM) and a large language model (LLM). The method leverages per-token entropy, defined as $\mathcal{H}_t = \frac{-\sum_i p_{t,i}\log p_{t,i}}{\log V}$, to route low-entropy tokens to the SLM and high-entropy tokens to the LLM, with independent KV caches to minimize switching overhead. An extension, R-Stitch+, adds a latency-aware RL router that adaptively balances efficiency and accuracy beyond fixed thresholds, using a reward $R = r_{\text{acc}} + r_{\text{eff}}$ where $r_{\text{eff}} = - \lambda \cdot r_{\text{acc}} \cdot \widehat{L}$. Empirical results on multiple math benchmarks show substantial speedups (up to 4.10x for 32B models) while preserving accuracy near full LLM decoding, and the approach remains adaptable to different budgets without retraining. The work demonstrates that training-free, entropy-driven collaboration between heterogeneous models provides a practical path to scalable CoT reasoning in real-world deployments.
Abstract
Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided routing strategy that lets the SLM efficiently handle low-entropy tokens while delegating uncertain ones to the LLM, thereby avoiding full rollbacks and preserving answer quality. We further extend this design with R-Stitch$^{+}$, which learns an adaptive routing policy to adjust the token budget dynamically beyond fixed thresholds. By jointly reducing per-token decoding complexity and the number of generated tokens, our method achieves substantial acceleration with negligible accuracy loss. Concretely, it attains peak speedups of 3.00$\times$ on DeepSeek-R1-Distill-Qwen-7B, 3.85$\times$ on 14B, and 4.10$\times$ on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Moreover, it naturally enables adaptive efficiency--accuracy trade-offs that can be tailored to diverse computational budgets without retraining.
