Table of Contents
Fetching ...

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

Max Zuo, Francisco Piedrahita Velez, Xiaochen Li, Michael L. Littman, Stephen H. Bach

TL;DR

Planetarium introduces a rigorous benchmark for translating natural language planning descriptions into PDDL by defining a formal equivalence concept and implementing a scene-graph–based isomorphism algorithm. It couples this with a large, diverse dataset of text-to-PDDL pairs across Blocks World, Gripper, and Floor Tile to stress test abstraction and scale. Empirical results show that even strong LLMs produce syntactically valid PDDL far more often than semantically correct PDDL, underscoring the need for robust evaluation frameworks and hybrid planning approaches. The work provides code and data releases and outlines limitations and future improvements to broaden domain expressiveness and alignment with real-world planning needs.

Abstract

Recent works have explored using language models for planning problems. One approach examines translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). Existing evaluation methods struggle to ensure semantic correctness and rely on simple or unrealistic datasets. To bridge this gap, we introduce \textit{Planetarium}, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. \textit{Planetarium} features a novel PDDL equivalence algorithm that flexibly evaluates the correctness of generated PDDL, along with a dataset of 145,918 text-to-PDDL pairs across 73 unique state combinations with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, 96.1\% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 94.4\% are solvable, but only 24.8\% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

TL;DR

Planetarium introduces a rigorous benchmark for translating natural language planning descriptions into PDDL by defining a formal equivalence concept and implementing a scene-graph–based isomorphism algorithm. It couples this with a large, diverse dataset of text-to-PDDL pairs across Blocks World, Gripper, and Floor Tile to stress test abstraction and scale. Empirical results show that even strong LLMs produce syntactically valid PDDL far more often than semantically correct PDDL, underscoring the need for robust evaluation frameworks and hybrid planning approaches. The work provides code and data releases and outlines limitations and future improvements to broaden domain expressiveness and alignment with real-world planning needs.

Abstract

Recent works have explored using language models for planning problems. One approach examines translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). Existing evaluation methods struggle to ensure semantic correctness and rely on simple or unrealistic datasets. To bridge this gap, we introduce \textit{Planetarium}, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. \textit{Planetarium} features a novel PDDL equivalence algorithm that flexibly evaluates the correctness of generated PDDL, along with a dataset of 145,918 text-to-PDDL pairs across 73 unique state combinations with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, 96.1\% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 94.4\% are solvable, but only 24.8\% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.
Paper Structure (38 sections, 4 theorems, 3 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 38 sections, 4 theorems, 3 equations, 8 figures, 3 tables, 1 algorithm.

Key Result

theorem 1

Equivalent($P_a$, $P_b$, False) returns True if and only if the PDDL problem files $P_a$ and $P_b$ represent equivalent planning problems under Definition def:equivalence. Equivalent($P_a$, $P_b$, True) returns True if and only if $P_a$ represents a planning problem that is equivalent to some planni

Figures (8)

  • Figure 1: An example of one planning goal corresponding to many correct PDDL goals. All PDDL goals in the top row represent the displayed goal correctly. The bottom row illustrates PDDL goals with different error types, showing instances that are solvable (a planner can generate a plan, but for a different planning problem), parseable (the PDDL syntax is correct but will not produce any plan from a planner), and not parseable (it is not valid PDDL). See Section \ref{['sec:eval']} for details.
  • Figure 2: A demonstration of how fullySpecify fills in gaps of the goal state of a planning problem.
  • Figure 3: Performance of various models on the test set.
  • Figure 4: Breakdown of zero-shot performance by domain for Gemma 2 27B IT, GPT-4o, and o1-mini.
  • Figure 5: An illustration of the algorithm to check if two PDDL problems are equivalent. It shows each of the stages of the algorithm: transforming to scene graphs, fully specifying the goal propositions, and checking for graph isomorphism.
  • ...and 3 more figures

Theorems & Definitions (10)

  • definition 1
  • definition 2
  • theorem 1
  • proposition 1
  • proof
  • proposition 2
  • proof
  • proof : Proof of Theorem \ref{['thm:main']}
  • lemma 1
  • proof