Table of Contents
Fetching ...

Procedural Knowledge Improves Agentic LLM Workflows

Vincent Hsiao, Mark Roberts, Leslie Smith

TL;DR

This paper tackles the difficulty of planning in agentic LLMs by embedding procedural knowledge through Hierarchical Task Networks (HTNs) within an agentic LLM workflow (ProcLLM). It formalizes an HTN+MDP framework and demonstrates, across four benchmarks, that hand-coded HTNs significantly boost task success and can let smaller LLMs outperform larger baselines, with LLM-generated HTNs providing additional but variable gains. The findings suggest that leveraging procedural knowledge from humans and machines will be a key tool for improving LLM workflows in practice, enabling more reliable planning, faster response times, and better scalability across task complexity.

Abstract

Large language models (LLMs) often struggle when performing agentic tasks without substantial tool support, prom-pt engineering, or fine tuning. Despite research showing that domain-dependent, procedural knowledge can dramatically increase planning efficiency, little work evaluates its potential for improving LLM performance on agentic tasks that may require implicit planning. We formalize, implement, and evaluate an agentic LLM workflow that leverages procedural knowledge in the form of a hierarchical task network (HTN). Empirical results of our implementation show that hand-coded HTNs can dramatically improve LLM performance on agentic tasks, and using HTNs can boost a 20b or 70b parameter LLM to outperform a much larger 120b parameter LLM baseline. Furthermore, LLM-created HTNs improve overall performance, though less so. The results suggest that leveraging expertise--from humans, documents, or LLMs--to curate procedural knowledge will become another important tool for improving LLM workflows.

Procedural Knowledge Improves Agentic LLM Workflows

TL;DR

This paper tackles the difficulty of planning in agentic LLMs by embedding procedural knowledge through Hierarchical Task Networks (HTNs) within an agentic LLM workflow (ProcLLM). It formalizes an HTN+MDP framework and demonstrates, across four benchmarks, that hand-coded HTNs significantly boost task success and can let smaller LLMs outperform larger baselines, with LLM-generated HTNs providing additional but variable gains. The findings suggest that leveraging procedural knowledge from humans and machines will be a key tool for improving LLM workflows in practice, enabling more reliable planning, faster response times, and better scalability across task complexity.

Abstract

Large language models (LLMs) often struggle when performing agentic tasks without substantial tool support, prom-pt engineering, or fine tuning. Despite research showing that domain-dependent, procedural knowledge can dramatically increase planning efficiency, little work evaluates its potential for improving LLM performance on agentic tasks that may require implicit planning. We formalize, implement, and evaluate an agentic LLM workflow that leverages procedural knowledge in the form of a hierarchical task network (HTN). Empirical results of our implementation show that hand-coded HTNs can dramatically improve LLM performance on agentic tasks, and using HTNs can boost a 20b or 70b parameter LLM to outperform a much larger 120b parameter LLM baseline. Furthermore, LLM-created HTNs improve overall performance, though less so. The results suggest that leveraging expertise--from humans, documents, or LLMs--to curate procedural knowledge will become another important tool for improving LLM workflows.

Paper Structure

This paper contains 38 sections, 4 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Top: An in-progress HTN decomposition tree for the Travel Planner problem of §\ref{['sec:introduction']}. White boxes indicate abstract tasks, round ovals indicate methods that decompose tasks, and gray boxes indicate primitive tasks for the LLM to execute. The dashed box for $\kappa_{1}$ indicates it has passed verify. Bottom: Notional methods for the same problem.
  • Figure 2: System overview, labeled with components of the MDP. Blue boxes denote text files, orange boxes denote LLMs, red boxes denote python files, and green boxes denote API files (scripts/databases/etc.).
  • Figure 3: Success rates for BlockWorld and Unit Movement domains, evaluated across increasing problem complexity ($b$ starting blocks and $h$ final stack height for BW, number of units $n$ for UM). Bars show mean success rates, with 95% Wilson confidence intervals.
  • Figure 4: A comparison of average runtime statistics for GPT-oss across different benchmarks (BW - BlocksWorld, TP - Travel Planner, UM - Unit Movement). (See Fig. \ref{['fig:timeresults-detailed']} for full details)
  • Figure A.1: An example problem instance for the unit movement domain showing the initial starting position (blue circles) and a valid solution (red circles).
  • ...and 1 more figures