Table of Contents
Fetching ...

PLANET: A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities

Haoming Li, Zhaoliang Chen, Jonathan Zhang, Fei Liu

TL;DR

The paper addresses the challenge of evaluating LLM agents' planning capabilities across diverse domains by surveying a broad set of benchmarks organized into seven categories: embodied environments, web navigation, scheduling, games and puzzles, everyday task automation, text-based reasoning, and agentic benchmarks. It synthesizes key benchmarks, discusses their strengths and limitations, and provides guidance on selecting appropriate tests while identifying gaps such as dynamic world models, long-horizon planning under uncertainty, and multimodal grounding. The main contribution is a comprehensive taxonomy and critique that informs both benchmark usage and future development, with the aim of promoting more robust, generalizable planning in LLM-driven agents. The work thus serves as a practical roadmap for researchers and practitioners to benchmark, compare, and advance planning capabilities in real-world AI systems.

Abstract

Planning is central to agents and agentic AI. The ability to plan, e.g., creating travel itineraries within a budget, holds immense potential in both scientific and commercial contexts. Moreover, optimal plans tend to require fewer resources compared to ad-hoc methods. To date, a comprehensive understanding of existing planning benchmarks appears to be lacking. Without it, comparing planning algorithms' performance across domains or selecting suitable algorithms for new scenarios remains challenging. In this paper, we examine a range of planning benchmarks to identify commonly used testbeds for algorithm development and highlight potential gaps. These benchmarks are categorized into embodied environments, web navigation, scheduling, games and puzzles, and everyday task automation. Our study recommends the most appropriate benchmarks for various algorithms and offers insights to guide future benchmark development.

PLANET: A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities

TL;DR

The paper addresses the challenge of evaluating LLM agents' planning capabilities across diverse domains by surveying a broad set of benchmarks organized into seven categories: embodied environments, web navigation, scheduling, games and puzzles, everyday task automation, text-based reasoning, and agentic benchmarks. It synthesizes key benchmarks, discusses their strengths and limitations, and provides guidance on selecting appropriate tests while identifying gaps such as dynamic world models, long-horizon planning under uncertainty, and multimodal grounding. The main contribution is a comprehensive taxonomy and critique that informs both benchmark usage and future development, with the aim of promoting more robust, generalizable planning in LLM-driven agents. The work thus serves as a practical roadmap for researchers and practitioners to benchmark, compare, and advance planning capabilities in real-world AI systems.

Abstract

Planning is central to agents and agentic AI. The ability to plan, e.g., creating travel itineraries within a budget, holds immense potential in both scientific and commercial contexts. Moreover, optimal plans tend to require fewer resources compared to ad-hoc methods. To date, a comprehensive understanding of existing planning benchmarks appears to be lacking. Without it, comparing planning algorithms' performance across domains or selecting suitable algorithms for new scenarios remains challenging. In this paper, we examine a range of planning benchmarks to identify commonly used testbeds for algorithm development and highlight potential gaps. These benchmarks are categorized into embodied environments, web navigation, scheduling, games and puzzles, and everyday task automation. Our study recommends the most appropriate benchmarks for various algorithms and offers insights to guide future benchmark development.

Paper Structure

This paper contains 10 sections, 5 figures.

Figures (5)

  • Figure 1: Sourced from ALFWorld shridhar2021alfworldaligningtextembodied, this example illustrates interactive alignment between text and embodied worlds.
  • Figure 2: Directly adapted from VisualWebArena koh2024visualwebarenaevaluatingmultimodalagents, this example shows an agent's action trajectory to block the author of a target image post in /f/memes.
  • Figure 3: Adapted from Natural Plan zheng2024naturalplanbenchmarkingllms, this example illustrates meeting times and locations for a group of friends. The objective is to maximize the number of friends one can meet, considering constraints such as travel time between locations.
  • Figure 4: Adapted from Dualformer su2024dualformercontrollablefastslow, this example illustrates the maze navigation task, where the task (prompt) and the plan are both represented as token sequences.
  • Figure 5: Adapted from RAP hao2023reasoninglanguagemodelplanning, this figure illustrates plan generation in BlocksWorld (left), mathematical reasoning in GSM8K (center), and logical reasoning in PrOntoQA (right).