Table of Contents
Fetching ...

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Zhiqiang Gao, Shufei Zhang, Sumon Biswas

TL;DR

This work tackles the lack of global planning in LLM reasoning by introducing Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization (PTA-GRPO), a two-stage framework that first builds high-level analytic plans via planning-structured SFT and then refines planning and reasoning with guidance-aware RL. The PSR-CS stage creates an analytical-guided dataset and initializes the policy through supervised fine-tuning, while the PSG-RL stage extends GRPO with a composite reward that evaluates the quality of the high-level plan, the final outcome, and output format. The approach yields consistent improvements on mathematical reasoning benchmarks across multiple base models, with larger gains for weaker models and robust gains for stronger ones, validating the importance of explicit planning in LLM reasoning. Theoretical analysis shows that optimizing the analytic plan increases mutual information between the predicted and true answers, reducing error probability, and enabling more reliable global planning in CoT. Overall, PTA-GRPO offers a generalizable method for enhancing internal planning and reasoning in LLMs with practical impact on complex problem solving tasks.

Abstract

Large language models (LLMs) have demonstrated remarkable reasoning abilities in complex tasks, often relying on Chain-of-Thought (CoT) reasoning. However, due to their autoregressive token-level generation, the reasoning process is largely constrained to local decision-making and lacks global planning. This limitation frequently results in redundant, incoherent, or inaccurate reasoning, which significantly degrades overall performance. Existing approaches, such as tree-based algorithms and reinforcement learning (RL), attempt to address this issue but suffer from high computational costs and often fail to produce optimal reasoning trajectories. To tackle this challenge, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization PTA-GRPO, a two-stage framework designed to improve both high-level planning and fine-grained CoT reasoning. In the first stage, we leverage advanced LLMs to distill CoT into compact high-level guidance, which is then used for supervised fine-tuning (SFT). In the second stage, we introduce a guidance-aware RL method that jointly optimizes the final output and the quality of high-level guidance, thereby enhancing reasoning effectiveness. We conduct extensive experiments on multiple mathematical reasoning benchmarks, including MATH, AIME2024, AIME2025, and AMC, across diverse base models such as Qwen2.5-7B-Instruct, Qwen3-8B, Qwen3-14B, and LLaMA3.2-3B. Experimental results demonstrate that PTA-GRPO consistently achieves stable and significant improvements across different models and tasks, validating its effectiveness and generalization.

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

TL;DR

This work tackles the lack of global planning in LLM reasoning by introducing Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization (PTA-GRPO), a two-stage framework that first builds high-level analytic plans via planning-structured SFT and then refines planning and reasoning with guidance-aware RL. The PSR-CS stage creates an analytical-guided dataset and initializes the policy through supervised fine-tuning, while the PSG-RL stage extends GRPO with a composite reward that evaluates the quality of the high-level plan, the final outcome, and output format. The approach yields consistent improvements on mathematical reasoning benchmarks across multiple base models, with larger gains for weaker models and robust gains for stronger ones, validating the importance of explicit planning in LLM reasoning. Theoretical analysis shows that optimizing the analytic plan increases mutual information between the predicted and true answers, reducing error probability, and enabling more reliable global planning in CoT. Overall, PTA-GRPO offers a generalizable method for enhancing internal planning and reasoning in LLMs with practical impact on complex problem solving tasks.

Abstract

Large language models (LLMs) have demonstrated remarkable reasoning abilities in complex tasks, often relying on Chain-of-Thought (CoT) reasoning. However, due to their autoregressive token-level generation, the reasoning process is largely constrained to local decision-making and lacks global planning. This limitation frequently results in redundant, incoherent, or inaccurate reasoning, which significantly degrades overall performance. Existing approaches, such as tree-based algorithms and reinforcement learning (RL), attempt to address this issue but suffer from high computational costs and often fail to produce optimal reasoning trajectories. To tackle this challenge, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization PTA-GRPO, a two-stage framework designed to improve both high-level planning and fine-grained CoT reasoning. In the first stage, we leverage advanced LLMs to distill CoT into compact high-level guidance, which is then used for supervised fine-tuning (SFT). In the second stage, we introduce a guidance-aware RL method that jointly optimizes the final output and the quality of high-level guidance, thereby enhancing reasoning effectiveness. We conduct extensive experiments on multiple mathematical reasoning benchmarks, including MATH, AIME2024, AIME2025, and AMC, across diverse base models such as Qwen2.5-7B-Instruct, Qwen3-8B, Qwen3-14B, and LLaMA3.2-3B. Experimental results demonstrate that PTA-GRPO consistently achieves stable and significant improvements across different models and tasks, validating its effectiveness and generalization.

Paper Structure

This paper contains 24 sections, 2 theorems, 35 equations, 5 figures, 4 tables.

Key Result

Theorem 3.1

Let $q$ denote the input question, $t$ the analytic plan, $\hat{y}$ the answer predicted by the policy model, and $y$ the ground-truth answer. With error probability $p_{\text{error}}$, it holds that: where $H(\cdot)$ denotes the entropy, and $I(\cdot)$ denotes the mutual information.

Figures (5)

  • Figure 1: (a) GRPO reasoning processing. (b) PTA-GRPO reasoning process. (c) Impact of analytic plan. In (c), the accuracy of different reasoning modes, where Qwen2.5-7B-Instruct is considered as the base model. Yellow indicates the base model using CoT reasoning, blue indicates the base model reasoning with its own self-generated analytic plan, and green indicates the base model reasoning with an analytic plan generated by GPT-o1. More test cases of PTA-GRPO are shown in Appendix \ref{['sec: test cases.']}.
  • Figure 2: Comparison between GRPO and PTA-GRPO. It is worth noting that, to ensure a fair comparison, the number of rollout responses is kept the same between GRPO and PTA-GRPO.
  • Figure 3: Effect of scaling test-time compute on AIME25 (Pass@K), with Qwen2.5-7B-Instruct as the base model.
  • Figure 4: Training Dynamics of PTA-GRPO with Qwen3-8B.
  • Figure 5: Training Dynamics of PTA-GRPO with Qwen2.5-7B-Instruct.

Theorems & Definitions (5)

  • Theorem 3.1
  • Remark 3.2
  • proof
  • Lemma A.1
  • proof : Proof by induction