Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

Fangru Lin; Emanuele La Malfa; Valentin Hofmann; Elle Michelle Yang; Anthony Cohn; Janet B. Pierrehumbert

Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony Cohn, Janet B. Pierrehumbert

TL;DR

The paper addresses asynchronous planning by formalizing it as a longest-path problem on a DAG and introducing a dedicated benchmark, AsyncHow. It proposes Plan Like a Graph (PLaG), a graph-enhanced prompting technique that improves LLM performance across models and task complexities, achieving state-of-the-art results but revealing persistent degradation as complexity grows. The study provides a formal complexity measure that predicts performance, conducts extensive cross-model experiments, and offers rich analyses (ablations, out-of-distribution probes, qualitative cases) to understand the limits of LLMs as autonomous planning devices. Overall, while PLaG boosts capabilities, the results highlight fundamental scalability limits and motivate further exploration of graph-informed representations and multimodal data for robust autonomous planning.

Abstract

Planning is a fundamental property of human intelligence. Reasoning about asynchronous plans is challenging since it requires sequential and parallel planning to optimize time costs. Can large language models (LLMs) succeed at this task? Here, we present the first large-scale study investigating this question. We find that a representative set of closed and open-source LLMs, including GPT-4 and LLaMA-2, behave poorly when not supplied with illustrations about the task-solving process in our benchmark AsyncHow. We propose a novel technique called Plan Like a Graph (PLaG) that combines graphs with natural language prompts and achieves state-of-the-art results. We show that although PLaG can boost model performance, LLMs still suffer from drastic degradation when task complexity increases, highlighting the limits of utilizing LLMs for simulating digital devices. We see our study as an exciting step towards using LLMs as efficient autonomous agents. Our code and data are available at https://github.com/fangru-lin/graph-llm-asynchow-plan.

Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

TL;DR

Abstract

Paper Structure (50 sections, 1 equation, 15 figures, 8 tables)

This paper contains 50 sections, 1 equation, 15 figures, 8 tables.

Introduction
Preliminaries: Naturalistic Asynchronous Planning
Complexity of Naturalistic Planning
Method: Plan Like a Graph
The AsyncHow Benchmark for Planning
Quality Check
Benchmarking Experiment
Experimental Setting and Design
Experiment Results
Further Analysis of GPT-3.5/4 Results
Accuracy vs. Complexity
Ablation Study
Out-of-distribution Probing
Qualitative Study
Wrong answers in easy problems.
...and 35 more sections

Figures (15)

Figure 1: A planning task (top) can be executed sequentially, in parallel, or asynchronously. Blue arrows denote action ordering constraints. Although complete parallelism is logically the most time-efficient strategy, it results in invalid reasoning steps (e.g. 'Baking' cannot happen at the same time with 'Rolling the dough'); at the same time, sequentially executing each task negatively affects efficiency. Given infinite resources, an optimal (asynchronous) plan should parallelize actions wherever possible.
Figure 2: Comparing standard Input-Output (IO) prompting with our method (PLaG). Here, we illustrate PLaG (explicit graph) with an adjacency list, but it can be of any graph type in practice. The standard IO method is similarly deployed in zero-shot, zero-shot + CoT, k-shot, k-shot + CoT in this paper. Please refer to Appendix \ref{['sec:prompt-bench']} for more details.
Figure 3: GPT-3.5 and GPT-4 accuracy as a function of asynchronous planning task complexity $|V|+|E|$ (see Section \ref{['sec:formalism']}), after binning results by width of 2. The upper figure plots the performance of methods without PLaG (our method), and the lower plot displays the best method with/without PLaG.
Figure 4: The series-parallel DAG used to solve the planning task in Figure \ref{['fig:task_desc']}. The path for calculating optimal time duration is highlighted in red.
Figure 5: Overview of the AsyncHow benchmark. The three bar charts on the left display the instance numbers for the shortest/longest sequential path length and $|V|+|E|$ in different plans. The pie chart on the right shows the topic distribution in our dataset. See Appendix \ref{['sec:topic-assignment']} for details about the topic assignment.
...and 10 more figures

Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

TL;DR

Abstract

Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (15)