Chasing Progress, Not Perfection: Revisiting Strategies for End-to-End LLM Plan Generation
Sukai Huang, Trevor Cohn, Nir Lipovetzky
TL;DR
The paper tackles the debate over LLM-based planning by building an end-to-end planner and evaluating it with broad, multi-faceted metrics, including in-domain and out-of-distribution performance. It demonstrates that fine-tuning solely on planning data yields weak OOD generalization, while Chain-of-Thought prompts improve executability but not final validity; reinforcement learning with a novel Longest Contiguous Common Subsequence (LCCS) reward emerges as the most effective approach for both validity and executability on longer tasks. The authors introduce an Extended PlanBench with longer plans and obfuscated domains to stress-test planning capabilities, revealing substantial generalization gaps and the importance of data diversity. Overall, the work provides nuanced insights into how different reasoning strategies influence planning outcomes and highlights RL with LCCS as a promising direction for improving end-to-end LLM planning, while also outlining the remaining challenges in achieving robust, widely generalizable plan generation.
Abstract
The capability of Large Language Models (LLMs) to plan remains a topic of debate. Some critics argue that strategies to boost LLMs' reasoning skills are ineffective in planning tasks, while others report strong outcomes merely from training models on a planning corpus. This study reassesses recent strategies by developing an end-to-end LLM planner and employing diverse metrics for a thorough evaluation. We find that merely fine-tuning LLMs on a corpus of planning instances does not lead to robust planning skills, as indicated by poor performance on out-of-distribution test sets. At the same time, we find that various strategies, including Chain-of-Thought, do enhance the probability of a plan being executable. This indicates progress towards better plan quality, despite not directly enhancing the final validity rate. Among the strategies we evaluated, reinforcement learning with our novel `Longest Contiguous Common Subsequence' reward emerged as the most effective, contributing to both plan validity and executability. Overall, our research addresses key misconceptions in the LLM-planning literature; we validate incremental progress in plan executability, although plan validity remains a challenge. Hence, future strategies should focus on both these aspects, drawing insights from our findings.
