Table of Contents
Fetching ...

Procedural Knowledge at Scale Improves Reasoning

Di Wu, Devendra Singh Sachan, Wen-tau Yih, Mingda Chen

Abstract

Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.

Procedural Knowledge at Scale Improves Reasoning

Abstract

Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.

Paper Structure

This paper contains 35 sections, 3 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Illustration of the Reasoning Memory framework. (a) We extract self-contained procedural knowledge from diverse public reasoning trajectories to construct a datastore. (b) At inference time, reasoning models retrieve and reuse relevant procedures within the reasoning trace, enabling test-time scaling through in-thought procedural retrieval.
  • Figure 2: Standard document RAG benefits instruction-tuned models more than reasoning models. Under the CompactDS pipeline, instruction-tuned models obtain modest gains from retrieval, whereas the corresponding reasoning models often see limited gains or even degradation, despite much stronger no-retrieval performance.
  • Figure 3: Performance as a function of inference budget. We compare Length Scaling without retrieval with two Reasoning Memory variants on DeepSeek-R1-Distill-Llama-8B as the total sampling budget $m$ increases.
  • Figure 4: Effect of datastore size and composition on Reasoning Memory. Performance of DeepSeek-R1-Distill-Llama-8B with budget $m = 30$. Larger and more diverse datastores generally yield stronger performance.
  • Figure 5: Utility of different types of synthesized knowledge. We report performance gains relative to the no-retrieval setting. Both factual and procedural knowledge help, but procedural knowledge yields larger gains across tasks and model families.
  • ...and 5 more figures