Table of Contents
Fetching ...

Guideline Forest: Retrieval-Augmented Reasoning with Branching Experience-Induced Guidelines

Jiaxiang Chen, Zhuo Wang, Mingxi Zou, Qifan Wang, Zenglin Xu

TL;DR

Guideline Forest introduces a memory-augmented, retrieval-guided reasoning framework that stores verified gold reasoning traces as reusable experience and induces structured guidelines to steer multi-step problem solving. By retrieving relevant reasoning trajectories and executing multiple guideline-driven branches with stepwise aggregation (and optional multi-model collaboration), it achieves robust, scalable reasoning across math and code benchmarks, outperforming strong baselines such as CoT, ReAct, ToT, FoT, and AFlow. Ablation studies confirm the importance of selective retrieval, path diversity, and early-step aggregation, while demonstrations show the approach generalizes to enhance diverse reasoning methods and enables cross-model collaboration. The work suggests a practical path toward more transparent, cooperative, and efficient reasoning in large language models, with potential impact on complex real-world problem solving.

Abstract

Retrieval-augmented generation (RAG) has been widely adopted to ground large language models (LLMs) in external knowledge, yet it remains largely underexplored for improving reasoning. Existing methods either rely on online exploration during inference or heuristic supervision over reasoning trajectories, but they fail to effectively accumulate and reuse past reasoning experience. We propose Guideline Forest, a retrieval-augmented reasoning framework that explicitly leverages experience to guide multi-step reasoning. The framework stores high-quality, label-consistent reasoning traces as reusable memory, retrieves relevant experiences for new problems, and induces them into structured guidelines that steer reasoning and enable controlled branching and aggregation. Experiments on mathematical (GSM8K, MATH-500) and programming (MBPP, HumanEval) benchmarks demonstrate consistent improvements over strong reasoning baselines, including CoT, ReAct, ToT, FoT, and AFlow. Further analyses show that experience retrieval, guideline-induced diversity, and stepwise aggregation are key to the framework's effectiveness. Beyond single-model reasoning, Guideline Forest generalizes to enhance diverse reasoning paradigms and supports multi-model collaboration, highlighting its flexibility and scalability.

Guideline Forest: Retrieval-Augmented Reasoning with Branching Experience-Induced Guidelines

TL;DR

Guideline Forest introduces a memory-augmented, retrieval-guided reasoning framework that stores verified gold reasoning traces as reusable experience and induces structured guidelines to steer multi-step problem solving. By retrieving relevant reasoning trajectories and executing multiple guideline-driven branches with stepwise aggregation (and optional multi-model collaboration), it achieves robust, scalable reasoning across math and code benchmarks, outperforming strong baselines such as CoT, ReAct, ToT, FoT, and AFlow. Ablation studies confirm the importance of selective retrieval, path diversity, and early-step aggregation, while demonstrations show the approach generalizes to enhance diverse reasoning methods and enables cross-model collaboration. The work suggests a practical path toward more transparent, cooperative, and efficient reasoning in large language models, with potential impact on complex real-world problem solving.

Abstract

Retrieval-augmented generation (RAG) has been widely adopted to ground large language models (LLMs) in external knowledge, yet it remains largely underexplored for improving reasoning. Existing methods either rely on online exploration during inference or heuristic supervision over reasoning trajectories, but they fail to effectively accumulate and reuse past reasoning experience. We propose Guideline Forest, a retrieval-augmented reasoning framework that explicitly leverages experience to guide multi-step reasoning. The framework stores high-quality, label-consistent reasoning traces as reusable memory, retrieves relevant experiences for new problems, and induces them into structured guidelines that steer reasoning and enable controlled branching and aggregation. Experiments on mathematical (GSM8K, MATH-500) and programming (MBPP, HumanEval) benchmarks demonstrate consistent improvements over strong reasoning baselines, including CoT, ReAct, ToT, FoT, and AFlow. Further analyses show that experience retrieval, guideline-induced diversity, and stepwise aggregation are key to the framework's effectiveness. Beyond single-model reasoning, Guideline Forest generalizes to enhance diverse reasoning paradigms and supports multi-model collaboration, highlighting its flexibility and scalability.

Paper Structure

This paper contains 36 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: A conceptual illustration of our approach. The agent is trained with question–label pairs to produce high-quality, label-consistent gold reasoning stored in memory. When facing a new question, it retrieves relevant gold reasoning, induces guidelines, and solves the problem through stepwise branch reasoning.
  • Figure 2: Overall framework of Guideline Forest. (Left) During training, the model learns to produce high-quality, label-consistent gold reasonings through iterative correction— first via chain-of-thought (CoT), then with label guidance, structured exploration (ToT), and memory-based guideline usage when necessary. Verified trajectories are stored in a memory repository, forming a growing collection of reusable reasoning experience. (Right) During inference, the system retrieves relevant gold reasonings, induces multi-branch guidelines, and executes them in parallel, selecting the best at each step to refine reasoning and yield a higher-quality answer.
  • Figure 3: Illustration of the relationship between training accuracy and different stage of iterations.
  • Figure 4: Ablation studies on MATH-500 dataset evaluating the effects of (a) positive sample count, (b) reasoning path count, (c) self-refinement operation, and (d) aggregation strategies.
  • Figure 5: Comparative experiments are conducted to highlight the impact of guideline on different models and the additional benefit from model collaboration.
  • ...and 1 more figures