Table of Contents
Fetching ...

Make Planning Research Rigorous Again!

Michael Katz, Harsha Kokel, Christian Muise, Shirin Sohrabi, Sarath Sreedharan

TL;DR

The paper addresses the risk that rapid progress in LLM-based planning lacks rigorous, reproducible evaluation. It argues for importing decades of automated planning practices—formal problem formulations, benchmarks, validators, and transparent experiments—into the design and evaluation of LLM-based planners. It contributes a structured overview of planning formalisms, a catalog of common pitfalls, data and tooling guidance, and concrete evaluation best practices to guide researchers and reviewers. By adopting these planning-centered methodologies, the work aims to improve reproducibility, enable meaningful comparisons, and accelerate robust progress in LLM-based planning with broader benefits to the planning community.

Abstract

In over sixty years since its inception, the field of planning has made significant contributions to both the theory and practice of building planning software that can solve a never-before-seen planning problem. This was done through established practices of rigorous design and evaluation of planning systems. It is our position that this rigor should be applied to the current trend of work on planning with large language models. One way to do so is by correctly incorporating the insights, tools, and data from the automated planning community into the design and evaluation of LLM-based planners. The experience and expertise of the planning community are not just important from a historical perspective; the lessons learned could play a crucial role in accelerating the development of LLM-based planners. This position is particularly important in light of the abundance of recent works that replicate and propagate the same pitfalls that the planning community has encountered and learned from. We believe that avoiding such known pitfalls will contribute greatly to the progress in building LLM-based planners and to planning in general.

Make Planning Research Rigorous Again!

TL;DR

The paper addresses the risk that rapid progress in LLM-based planning lacks rigorous, reproducible evaluation. It argues for importing decades of automated planning practices—formal problem formulations, benchmarks, validators, and transparent experiments—into the design and evaluation of LLM-based planners. It contributes a structured overview of planning formalisms, a catalog of common pitfalls, data and tooling guidance, and concrete evaluation best practices to guide researchers and reviewers. By adopting these planning-centered methodologies, the work aims to improve reproducibility, enable meaningful comparisons, and accelerate robust progress in LLM-based planning with broader benefits to the planning community.

Abstract

In over sixty years since its inception, the field of planning has made significant contributions to both the theory and practice of building planning software that can solve a never-before-seen planning problem. This was done through established practices of rigorous design and evaluation of planning systems. It is our position that this rigor should be applied to the current trend of work on planning with large language models. One way to do so is by correctly incorporating the insights, tools, and data from the automated planning community into the design and evaluation of LLM-based planners. The experience and expertise of the planning community are not just important from a historical perspective; the lessons learned could play a crucial role in accelerating the development of LLM-based planners. This position is particularly important in light of the abundance of recent works that replicate and propagate the same pitfalls that the planning community has encountered and learned from. We believe that avoiding such known pitfalls will contribute greatly to the progress in building LLM-based planners and to planning in general.

Paper Structure

This paper contains 27 sections, 1 figure.

Figures (1)

  • Figure 1: Benchmarks used in the literature for LLM-based Planning evaluations