Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

Xingxuan Li; Weiwen Xu; Ruochen Zhao; Fangkai Jiao; Shafiq Joty; Lidong Bing

Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

Xingxuan Li, Weiwen Xu, Ruochen Zhao, Fangkai Jiao, Shafiq Joty, Lidong Bing

TL;DR

This paper tackles the difficulty of solving complex reasoning problems that require substantial domain knowledge by proposing CR-Planner, a critic-guided planning framework that coordinates reasoning steps with retrieval augmentation. It uses two fine-tuned critic models to guide sub-goal selection and execution, and employs Monte Carlo Tree Search to collect training data for these critics. Across domains including competitive programming, theorem-based math, and domain retrieval, CR-Planner achieves substantial improvements over strong baselines, often outperforming retrieval-augmented and chain-of-thought approaches. The findings demonstrate the value of explicit planning and critic-driven evaluation in enabling LLMs to perform more reliably on challenging tasks and reveal the framework's flexibility across different base models.

Abstract

State-of-the-art large language models (LLMs) exhibit impressive problem-solving capabilities but may struggle with complex reasoning and factual correctness. Existing methods harness the strengths of chain-of-thought and retrieval-augmented generation (RAG) to decompose a complex problem into simpler steps and apply retrieval to improve factual correctness. These methods work well on straightforward reasoning tasks but often falter on challenging tasks such as competitive programming and mathematics, due to frequent reasoning errors and irrelevant knowledge retrieval. To address this, we introduce Critic-guided planning with Retrieval-augmentation, CR-Planner, a novel framework that leverages fine-tuned critic models to guide both reasoning and retrieval processes through planning. CR-Planner solves a problem by iteratively selecting and executing sub-goals. Initially, it identifies the most promising sub-goal from reasoning, query generation, and retrieval, guided by rewards given by a critic model named sub-goal critic. It then executes this sub-goal through sampling and selecting the optimal output based on evaluations from another critic model named execution critic. This iterative process, informed by retrieved information and critic models, enables CR-Planner to effectively navigate the solution space towards the final answer. We employ Monte Carlo Tree Search to collect the data for training the critic models, allowing for a systematic exploration of action sequences and their long-term impacts. We validate CR-Planner on challenging domain-knowledge-intensive and reasoning-heavy tasks, including competitive programming, theorem-driven math reasoning, and complex domain retrieval problems. Our experiments demonstrate that CR-Planner significantly outperforms baselines, highlighting its effectiveness in addressing challenging problems by improving both reasoning and retrieval.

Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

TL;DR

Abstract

Paper Structure (49 sections, 3 equations, 4 figures, 6 tables)

This paper contains 49 sections, 3 equations, 4 figures, 6 tables.

Introduction
Critic-Guided Planning with Retrieval-Augmentation
Problem Formulation
Inference of CR-Planner
Action selection using the critic models.
State transition with the selected action.
Termination conditions and the final answer.
The Critic Models
Collecting data via MCTS.
Training.
Experiments
Setup
Models.
Baselines.
Competitive Programming
...and 34 more sections

Figures (4)

Figure 1: Comparison between (a) chain-of-thought reasoning wei2023chainofthought with retrieval-augmented generation lewis2020retrieval and (b) critic-guided planning with retrieval-augmentation or CR-Planner (this work). g($\cdot$) indicates the critic model (or value function) that assigns a reward (or value) to an action (see Equation \ref{['eq:reward']}). Texts in (b) highlighted in green are actions selected at each step. For succinct presentation, only pivotal steps are shown in the figure.
Figure 2: The retrieval-augmented and critic-guided planning (CR-Planner) framework. The figure illustrates training data collection via MCTS, critic model training, and inference. For succinct presentation, SubGoal observations (Reason, GenQuery, and Retrieve) are shown as labeled rectangles and Execution observations (Rationale, Query, and Doc) as labeled circles. A state $s_t$ includes all preceding nodes (observations) and arrows (actions) up to the last node.
Figure 3: Performances of different critic models vs. baseline.
Figure 4: Performances of various execution sampling.

Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

TL;DR

Abstract

Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)