Table of Contents
Fetching ...

Chasing Progress, Not Perfection: Revisiting Strategies for End-to-End LLM Plan Generation

Sukai Huang, Trevor Cohn, Nir Lipovetzky

TL;DR

The paper tackles the debate over LLM-based planning by building an end-to-end planner and evaluating it with broad, multi-faceted metrics, including in-domain and out-of-distribution performance. It demonstrates that fine-tuning solely on planning data yields weak OOD generalization, while Chain-of-Thought prompts improve executability but not final validity; reinforcement learning with a novel Longest Contiguous Common Subsequence (LCCS) reward emerges as the most effective approach for both validity and executability on longer tasks. The authors introduce an Extended PlanBench with longer plans and obfuscated domains to stress-test planning capabilities, revealing substantial generalization gaps and the importance of data diversity. Overall, the work provides nuanced insights into how different reasoning strategies influence planning outcomes and highlights RL with LCCS as a promising direction for improving end-to-end LLM planning, while also outlining the remaining challenges in achieving robust, widely generalizable plan generation.

Abstract

The capability of Large Language Models (LLMs) to plan remains a topic of debate. Some critics argue that strategies to boost LLMs' reasoning skills are ineffective in planning tasks, while others report strong outcomes merely from training models on a planning corpus. This study reassesses recent strategies by developing an end-to-end LLM planner and employing diverse metrics for a thorough evaluation. We find that merely fine-tuning LLMs on a corpus of planning instances does not lead to robust planning skills, as indicated by poor performance on out-of-distribution test sets. At the same time, we find that various strategies, including Chain-of-Thought, do enhance the probability of a plan being executable. This indicates progress towards better plan quality, despite not directly enhancing the final validity rate. Among the strategies we evaluated, reinforcement learning with our novel `Longest Contiguous Common Subsequence' reward emerged as the most effective, contributing to both plan validity and executability. Overall, our research addresses key misconceptions in the LLM-planning literature; we validate incremental progress in plan executability, although plan validity remains a challenge. Hence, future strategies should focus on both these aspects, drawing insights from our findings.

Chasing Progress, Not Perfection: Revisiting Strategies for End-to-End LLM Plan Generation

TL;DR

The paper tackles the debate over LLM-based planning by building an end-to-end planner and evaluating it with broad, multi-faceted metrics, including in-domain and out-of-distribution performance. It demonstrates that fine-tuning solely on planning data yields weak OOD generalization, while Chain-of-Thought prompts improve executability but not final validity; reinforcement learning with a novel Longest Contiguous Common Subsequence (LCCS) reward emerges as the most effective approach for both validity and executability on longer tasks. The authors introduce an Extended PlanBench with longer plans and obfuscated domains to stress-test planning capabilities, revealing substantial generalization gaps and the importance of data diversity. Overall, the work provides nuanced insights into how different reasoning strategies influence planning outcomes and highlights RL with LCCS as a promising direction for improving end-to-end LLM planning, while also outlining the remaining challenges in achieving robust, widely generalizable plan generation.

Abstract

The capability of Large Language Models (LLMs) to plan remains a topic of debate. Some critics argue that strategies to boost LLMs' reasoning skills are ineffective in planning tasks, while others report strong outcomes merely from training models on a planning corpus. This study reassesses recent strategies by developing an end-to-end LLM planner and employing diverse metrics for a thorough evaluation. We find that merely fine-tuning LLMs on a corpus of planning instances does not lead to robust planning skills, as indicated by poor performance on out-of-distribution test sets. At the same time, we find that various strategies, including Chain-of-Thought, do enhance the probability of a plan being executable. This indicates progress towards better plan quality, despite not directly enhancing the final validity rate. Among the strategies we evaluated, reinforcement learning with our novel `Longest Contiguous Common Subsequence' reward emerged as the most effective, contributing to both plan validity and executability. Overall, our research addresses key misconceptions in the LLM-planning literature; we validate incremental progress in plan executability, although plan validity remains a challenge. Hence, future strategies should focus on both these aspects, drawing insights from our findings.

Paper Structure

This paper contains 33 sections, 1 equation, 17 figures, 5 tables.

Figures (17)

  • Figure 1: (a) Next-token prediction: The LLM is trained to predict the next token in corpus containing domain and problem details and their plans, proceeding left-to-right. (b) Proposed LCCS Reward: We use the length of LCCS as an auxiliary reward signal for RL. It provides granular feedback to fill the sparse reward gap. (See Section \ref{['subsec:rl']}). (c) Reinforcement Learning: RL with LCCS Reward is shown to be the most effective strategy for enhancing end-to-end LLM planners among all tested strategies.
  • Figure 2: Each planning problem instance in PlanBench is serialized into a single text block that presents the context, action description, initial and goal states, and the plan.
  • Figure 2: Ablation Study on Strategy Effectiveness in Planning. Validity rates (valid.) and the Executability rate (exec.) are analyzed. Strategies such as Permutation, CoT, and Self-Correct show no significant validity. improvements but enhance executability in 'long' and other OOD scenarios. Notably, 'Goal CoT' appears to hinder performance. We attribute this to the dual duties of generating plans and accurately estimating the heuristics of the state, which increases overall complexity and hinders the model's ability to effectively learn both aspects. RL emerges as the only strategy that enhances validity in OOD scenarios. Improvements of statistical significance are highlighted in green, while significant declines are highlighted in red.
  • Figure 3: Permutation augmentation strategy shuffles the order of action descriptions (red arrows), condition and effect descriptions (blue arrows), and atoms in initial and goal statements (green arrows). The model is expected to learn underlying semantics through more diverse data representation, avoiding overfitting to superficial patterns.
  • Figure 4: Two types of CoT prompts are used in the plan response -- Goal CoT and State CoT. Goal CoT prompts the agent to repeat the goal and count the remaining steps to the goal. State CoT prompts the agent to provide grounded conditions for the action and grounded effects. The model is expected to learn the world dynamics through these prompts.
  • ...and 12 more figures

Theorems & Definitions (3)

  • Definition A.1: Executability of a Plan
  • Definition A.2: Validity of a Plan
  • Definition G.1: Goal-Satisfiability of a Plan