Table of Contents
Fetching ...

ACPBench: Reasoning about Action, Change, and Planning

Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi

TL;DR

ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning, is presented, which consists of 7 reasoning tasks over 13 planning domains and is constructed from planning domains described in a formal language.

Abstract

There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multi-step reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. In this work, we present ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning. The benchmark consists of 7 reasoning tasks over 13 planning domains. The collection is constructed from planning domains described in a formal language. This allows us to synthesize problems with provably correct solutions across many tasks and domains. Further, it allows us the luxury of scale without additional human effort, i.e., many additional problems can be created automatically. Our extensive evaluation of 22 LLMs and OpenAI o1 reasoning models highlights the significant gap in the reasoning capability of the LLMs. Our findings with OpenAI o1, a multi-turn reasoning model, reveal significant gains in performance on multiple-choice questions, yet surprisingly, no notable progress is made on boolean questions. The ACPBench collection is available at https://ibm.github.io/ACPBench.

ACPBench: Reasoning about Action, Change, and Planning

TL;DR

ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning, is presented, which consists of 7 reasoning tasks over 13 planning domains and is constructed from planning domains described in a formal language.

Abstract

There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multi-step reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. In this work, we present ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning. The benchmark consists of 7 reasoning tasks over 13 planning domains. The collection is constructed from planning domains described in a formal language. This allows us to synthesize problems with provably correct solutions across many tasks and domains. Further, it allows us the luxury of scale without additional human effort, i.e., many additional problems can be created automatically. Our extensive evaluation of 22 LLMs and OpenAI o1 reasoning models highlights the significant gap in the reasoning capability of the LLMs. Our findings with OpenAI o1, a multi-turn reasoning model, reveal significant gains in performance on multiple-choice questions, yet surprisingly, no notable progress is made on boolean questions. The ACPBench collection is available at https://ibm.github.io/ACPBench.
Paper Structure (27 sections, 9 figures, 12 tables)

This paper contains 27 sections, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Performance of few state-of-the-art LLMs and OpenAI o1 reasoning models over different tasks in ACPBench. While the largest LLMs achieve more than $80\%$ accuracy on few tasks, the variance in performance across tasks and across LLMs is still big. This signifies the long way to go before they can be reliably used in practical scenarios.
  • Figure 2: Example of boolean and multi-choice questions from the Applicablity task in ACPBench. The context contains the domain and the problem description. Query to LLM consists of context and a boolean or multi-choice question.
  • Figure 3: Example of the COT prompt.
  • Figure 4: Comparison of $8$ top performing LLMs on multi-choice questions in $13$ domains of ACPBench. The mean of performance across the top-$8$ models is presented with dotted line in Black. The mean line indicates that none of the domains are exceptionally easy.
  • Figure 5: Comparison of different prompt styles on two pretrained models: Granite 8B and LLAMA-3 70B, and finetuned Granite 8B model for MCQ tasks in $5$ testing domains.
  • ...and 4 more figures