Table of Contents
Fetching ...

Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony Cohn, Janet B. Pierrehumbert

TL;DR

The paper addresses asynchronous planning by formalizing it as a longest-path problem on a DAG and introducing a dedicated benchmark, AsyncHow. It proposes Plan Like a Graph (PLaG), a graph-enhanced prompting technique that improves LLM performance across models and task complexities, achieving state-of-the-art results but revealing persistent degradation as complexity grows. The study provides a formal complexity measure that predicts performance, conducts extensive cross-model experiments, and offers rich analyses (ablations, out-of-distribution probes, qualitative cases) to understand the limits of LLMs as autonomous planning devices. Overall, while PLaG boosts capabilities, the results highlight fundamental scalability limits and motivate further exploration of graph-informed representations and multimodal data for robust autonomous planning.

Abstract

Planning is a fundamental property of human intelligence. Reasoning about asynchronous plans is challenging since it requires sequential and parallel planning to optimize time costs. Can large language models (LLMs) succeed at this task? Here, we present the first large-scale study investigating this question. We find that a representative set of closed and open-source LLMs, including GPT-4 and LLaMA-2, behave poorly when not supplied with illustrations about the task-solving process in our benchmark AsyncHow. We propose a novel technique called Plan Like a Graph (PLaG) that combines graphs with natural language prompts and achieves state-of-the-art results. We show that although PLaG can boost model performance, LLMs still suffer from drastic degradation when task complexity increases, highlighting the limits of utilizing LLMs for simulating digital devices. We see our study as an exciting step towards using LLMs as efficient autonomous agents. Our code and data are available at https://github.com/fangru-lin/graph-llm-asynchow-plan.

Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

TL;DR

The paper addresses asynchronous planning by formalizing it as a longest-path problem on a DAG and introducing a dedicated benchmark, AsyncHow. It proposes Plan Like a Graph (PLaG), a graph-enhanced prompting technique that improves LLM performance across models and task complexities, achieving state-of-the-art results but revealing persistent degradation as complexity grows. The study provides a formal complexity measure that predicts performance, conducts extensive cross-model experiments, and offers rich analyses (ablations, out-of-distribution probes, qualitative cases) to understand the limits of LLMs as autonomous planning devices. Overall, while PLaG boosts capabilities, the results highlight fundamental scalability limits and motivate further exploration of graph-informed representations and multimodal data for robust autonomous planning.

Abstract

Planning is a fundamental property of human intelligence. Reasoning about asynchronous plans is challenging since it requires sequential and parallel planning to optimize time costs. Can large language models (LLMs) succeed at this task? Here, we present the first large-scale study investigating this question. We find that a representative set of closed and open-source LLMs, including GPT-4 and LLaMA-2, behave poorly when not supplied with illustrations about the task-solving process in our benchmark AsyncHow. We propose a novel technique called Plan Like a Graph (PLaG) that combines graphs with natural language prompts and achieves state-of-the-art results. We show that although PLaG can boost model performance, LLMs still suffer from drastic degradation when task complexity increases, highlighting the limits of utilizing LLMs for simulating digital devices. We see our study as an exciting step towards using LLMs as efficient autonomous agents. Our code and data are available at https://github.com/fangru-lin/graph-llm-asynchow-plan.
Paper Structure (50 sections, 1 equation, 15 figures, 8 tables)

This paper contains 50 sections, 1 equation, 15 figures, 8 tables.

Figures (15)

  • Figure 1: A planning task (top) can be executed sequentially, in parallel, or asynchronously. Blue arrows denote action ordering constraints. Although complete parallelism is logically the most time-efficient strategy, it results in invalid reasoning steps (e.g. 'Baking' cannot happen at the same time with 'Rolling the dough'); at the same time, sequentially executing each task negatively affects efficiency. Given infinite resources, an optimal (asynchronous) plan should parallelize actions wherever possible.
  • Figure 2: Comparing standard Input-Output (IO) prompting with our method (PLaG). Here, we illustrate PLaG (explicit graph) with an adjacency list, but it can be of any graph type in practice. The standard IO method is similarly deployed in zero-shot, zero-shot + CoT, k-shot, k-shot + CoT in this paper. Please refer to Appendix \ref{['sec:prompt-bench']} for more details.
  • Figure 3: GPT-3.5 and GPT-4 accuracy as a function of asynchronous planning task complexity $|V|+|E|$ (see Section \ref{['sec:formalism']}), after binning results by width of 2. The upper figure plots the performance of methods without PLaG (our method), and the lower plot displays the best method with/without PLaG.
  • Figure 4: The series-parallel DAG used to solve the planning task in Figure \ref{['fig:task_desc']}. The path for calculating optimal time duration is highlighted in red.
  • Figure 5: Overview of the AsyncHow benchmark. The three bar charts on the left display the instance numbers for the shortest/longest sequential path length and $|V|+|E|$ in different plans. The pie chart on the right shows the topic distribution in our dataset. See Appendix \ref{['sec:topic-assignment']} for details about the topic assignment.
  • ...and 10 more figures