Table of Contents
Fetching ...

Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective

Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, Wei Chen

TL;DR

This work investigates RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods, revealing that supervised fine-tuning may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration.

Abstract

Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration's role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent Q-value bias in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.

Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective

TL;DR

This work investigates RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods, revealing that supervised fine-tuning may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration.

Abstract

Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration's role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent Q-value bias in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.

Paper Structure

This paper contains 40 sections, 9 theorems, 58 equations, 8 figures, 1 algorithm.

Key Result

Theorem 3.1

Assume Assumption asmp:3d holds. Let $N_{u_{\text{target}}, u_m, k}$ denote the number of occurrences in the training dataset where the target node is $u_{\text{target}}$, the current node is $u_m$, and the next node is $k$. The optimal solution of SFT satisfies: If $\sum_{k'} N_{u_{\text{target}}, u_m, k'} = 0$, output can be any probability distribution.

Figures (8)

  • Figure 1: Frequency of edge occurrences in the SFT training data $\mathcal{D}^{\text{SFT}}$ and the adjacency structures learned by different models. The underlying graph represents transitions between block configurations in Blocksworld valmeekam2023planbench.
  • Figure 2: Empirical results of PG training. Both PG and continual SFT are initialized from the same base model. Figures (a)-(c) illustrate the training dynamics of test accuracy (under greedy decoding), training accuracy (under temperature sampling), and response diversity (under temperature sampling). Figure (d) shows how different KL regularization strengths affect the final models.
  • Figure 3: Empirical comparison between Q-learning and PG. Figure (a) shows the training dynamics of training and test accuracy (under greedy decoding). Figure (b) compares the Pareto frontiers of output diversity and accuracy on the training and test sets (under temperature decoding).
  • Figure 4: Heatmap of normalized logits from the Q-learning model with process reward. For each row $i$ , green blocks indicate valid next nodes given the current node $0$ and target node $i$. The logits corresponding to these valid actions consistently increase during training.
  • Figure 5: Empirical validation that the trained one-layer one-head transformer acts as a function of the target node and the current node. The visualization of attention maps across SFT, PG, and Q-learning training shows a consistent, strong focus on the target node (token position 1).
  • ...and 3 more figures

Theorems & Definitions (18)

  • Theorem 3.1: Optimal Solution of SFT
  • Theorem 4.1: Connections between PG and SFT
  • Theorem 4.2: Convergence of PG without KL regularization
  • Theorem 4.3: Diversity Collapse of PG without KL regularization
  • Theorem 4.4: The effect of KL regularization
  • Lemma 5.1
  • Theorem 5.1: Stable points of outcome reward
  • Theorem 5.2: Stable points of process reward
  • Theorem 5.3: Stable points of process reward
  • proof
  • ...and 8 more