Table of Contents
Fetching ...

Towards a General Framework for HTN Modeling with LLMs

Israel Puerta-Merino, Carlos Núñez-Molina, Pablo Mesejo, Juan Fernández-Olivares

TL;DR

The paper addresses the gap in leveraging LLMs for hierarchical planning (HP) by extending the L2P framework to support HP modeling and evaluation. It introduces L2HP, an extensible framework with HTN data types, parsers, and a NL2HTN pipeline, enabling LLM-driven generation of domain and problem models for HP and supporting exports to PDDL, HPDL, and HDDL. Empirically, the study on PlanBench shows parsing success around 36% across AP and HP, but syntactic validity is much lower for HP (about 1%) than AP (about 20%), highlighting the unique difficulties HP poses for LLMs. Overall, the work provides a practical, reproducible platform for HP research with LLMs and outlines concrete directions, including HP-specific benchmarks, improved parsers, and more robust prompting strategies to improve model quality and usefulness.

Abstract

The use of Large Language Models (LLMs) for generating Automated Planning (AP) models has been widely explored; however, their application to Hierarchical Planning (HP) is still far from reaching the level of sophistication observed in non-hierarchical architectures. In this work, we try to address this gap. We present two main contributions. First, we propose L2HP, an extension of L2P (a library to LLM-driven PDDL models generation) that support HP model generation and follows a design philosophy of generality and extensibility. Second, we apply our framework to perform experiments where we compare the modeling capabilities of LLMs for AP and HP. On the PlanBench dataset, results show that parsing success is limited but comparable in both settings (around 36\%), while syntactic validity is substantially lower in the hierarchical case (1\% vs. 20\% of instances). These findings underscore the unique challenges HP presents for LLMs, highlighting the need for further research to improve the quality of generated HP models.

Towards a General Framework for HTN Modeling with LLMs

TL;DR

The paper addresses the gap in leveraging LLMs for hierarchical planning (HP) by extending the L2P framework to support HP modeling and evaluation. It introduces L2HP, an extensible framework with HTN data types, parsers, and a NL2HTN pipeline, enabling LLM-driven generation of domain and problem models for HP and supporting exports to PDDL, HPDL, and HDDL. Empirically, the study on PlanBench shows parsing success around 36% across AP and HP, but syntactic validity is much lower for HP (about 1%) than AP (about 20%), highlighting the unique difficulties HP poses for LLMs. Overall, the work provides a practical, reproducible platform for HP research with LLMs and outlines concrete directions, including HP-specific benchmarks, improved parsers, and more robust prompting strategies to improve model quality and usefulness.

Abstract

The use of Large Language Models (LLMs) for generating Automated Planning (AP) models has been widely explored; however, their application to Hierarchical Planning (HP) is still far from reaching the level of sophistication observed in non-hierarchical architectures. In this work, we try to address this gap. We present two main contributions. First, we propose L2HP, an extension of L2P (a library to LLM-driven PDDL models generation) that support HP model generation and follows a design philosophy of generality and extensibility. Second, we apply our framework to perform experiments where we compare the modeling capabilities of LLMs for AP and HP. On the PlanBench dataset, results show that parsing success is limited but comparable in both settings (around 36\%), while syntactic validity is substantially lower in the hierarchical case (1\% vs. 20\% of instances). These findings underscore the unique challenges HP presents for LLMs, highlighting the need for further research to improve the quality of generated HP models.

Paper Structure

This paper contains 24 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of a typical L2HP workflow: (1) The LLM is asked to model a planning task, following a template written in Markdown. (2) The LLM output is parsed to extract the planning elements into structured Python Classes. (3) Using the structured data, the planning model is generated -- i.e., both the domain and task files are created. (4) A symbolic planner is then invoked to generate a plan that solves the given problem.
  • Figure 2: Component Diagram of L2HP. Orange components are native to the L2P library, while blue ones were developed within L2HP. Components with both colors indicate that they were originally part of L2P and have been extended in L2HP.
  • Figure 3: An instance of the PlanBench generation subset.
  • Figure 4: An instance of a standardized task. The task shown is the same as in Figure \ref{['fig:planbench-example']}, after being preprocessed to conform to the L2HP standardized dataset structure.
  • Figure 5: Illustrative examples of different types of invalid LLM outputs. Examples (1) to (3) represents the three core categories of identifiable errors, while (4) shows an incorrect output that does not trigger an error assertion. In (1), the phrase a table called ?t lacks a proper structure, so the parser is unable to translate it into a valid parameter. In (2), both the object block and the type block are parsable, but a conflict arises when invoking the planner. In (3), no plan can be found because the Stack_block action lacks the (on ?b2 ?b1) effect. In (4), the planner finds a plan; however, it is likely incomplete, as the method Move_to_table_m1 lacks the (unstack_block ?b) subtask. A correct definition is shown in Figure \ref{['fig:overview']}.