Table of Contents
Fetching ...

Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts

Ammar Ahmed, Azal Ahmad Khan, Ayaan Ahmad, Sheng Di, Zirui Liu, Ali Anwar

Abstract

Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.

Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts

Abstract

Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.

Paper Structure

This paper contains 48 sections, 23 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: The figure contrasts Chain-of-Thought (CoT) inference in LRMs with our Retrieval-of-Thought (RoT) approach. In CoT (top), models sequentially explore multiple wrong paths, causing inefficiency and high token usage. RoT (bottom) builds on a structured thought graph where reasoning steps are stored as nodes. First, RoT retrieves relevant nodes and performs reward-guided traversal to assemble a problem-specific template, reducing redundant exploration and directing the model toward correct reasoning. This yields more efficient inference and fewer tokens.
  • Figure 2: Key observations motivating the RoT framework. (Left) Semantic similarity between Thought Store and steps to solve reasoning datasets. (Middle) Comparison of retrieval and generation showing retrieval is faster than generation. (Right) Thought Caching yields token savings.
  • Figure 3: Average accuracy versus output tokens across Qwen3 models (1.7B, 4B, 8B). Each panel reports CoT, CoT-SC, RAG, BoT, RoT, and RoT+TI (star). The shaded top-left region denotes the Efficient Reasoning Zone, corresponding to higher accuracy with fewer tokens. RoT+TI consistently lies within this region, matching the accuracy of other methods while using way fewer tokens.
  • Figure 4: Average per-sample inference cost (USD) across Qwen3 models, comparing CoT, Cot-SC, RAG, BoT, RoT, and RoT+TI. Costs are computed using Alibaba Cloud prices with average input/output token over AIME 2023/2024/2025 and AMC 2023. Arrows above RoT+TI indicate the percent cost reduction relative to CoT.
  • Figure 5: Experimental results of RoT and RoT+TI on GPQA scientific-reasoning results.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Definition 1: Templates and Steps
  • Definition 2: Thought Graph