Table of Contents
Fetching ...

ParaCook: On Time-Efficient Planning for Multi-Agent Systems

Shiqi Zhang, Xinbei Ma, Yunqing Xu, Zouying Cao, Pengrui Lu, Haobo Yuan, Tiancheng Shen, Zhuosheng Zhang, Hai Zhao, Ming-Hsuan Yang

TL;DR

ParaCook addresses the problem of time-efficient planning in multi-agent systems by introducing a scalable, Overcooked-inspired benchmark that evaluates both intra- and inter-agent parallelism. It models tasks as directed acyclic graphs with timing and dependency constraints, and it uses metrics such as $SR$ and $OCT$ (plus penalized/normalized variants) to quantify correctness and efficiency, supplemented by $MD$ and $AU$ for coordination insights. The study benchmarks state-of-the-art LLM planners (e.g., GPT-5, Claude-Opus, Gemini, DeepSeek, Qwen) and humans, finding that GPT-5 achieves the best overall performance but still lags human coordination on complex tasks; Chain-of-Thought prompting helps mainly for strong models and can destabilize weaker ones. Abstract planning tasks reveal near-optimal scheduling for several LLMs, underscoring strong high-level reasoning but exposing a gap between abstract planning and embodied execution, motivating hierarchical planning as a bridge. ParaCook provides a scalable framework for evaluating time-efficient, coordination-aware planning and offers a public codebase to foster further development in time-efficient multi-agent planning.

Abstract

Large Language Models (LLMs) exhibit strong reasoning abilities for planning long-horizon, real-world tasks, yet existing agent benchmarks focus on task completion while neglecting time efficiency in parallel and asynchronous operations. To address this, we present ParaCook, a benchmark for time-efficient collaborative planning. Inspired by the Overcooked game, ParaCook provides an environment for various challenging interaction planning of multi-agent systems that are instantiated as cooking tasks, with a simplified action space to isolate the core challenge of strategic parallel planning. Through a comprehensive evaluation of state-of-the-art LLMs, we find that current approaches achieve suboptimal plans, which struggle with parallel actions or coordination. Our analysis also reveals LLMs' potential on abstract tasks where they can focus on high-level parallel optimization. ParaCook provides a scalable evaluation framework with adjustable complexity, establishing a foundation for developing and assessing time efficiency-aware multi-agent planning. The code and data are available at https://github.com/zsq259/ParaCook.

ParaCook: On Time-Efficient Planning for Multi-Agent Systems

TL;DR

ParaCook addresses the problem of time-efficient planning in multi-agent systems by introducing a scalable, Overcooked-inspired benchmark that evaluates both intra- and inter-agent parallelism. It models tasks as directed acyclic graphs with timing and dependency constraints, and it uses metrics such as and (plus penalized/normalized variants) to quantify correctness and efficiency, supplemented by and for coordination insights. The study benchmarks state-of-the-art LLM planners (e.g., GPT-5, Claude-Opus, Gemini, DeepSeek, Qwen) and humans, finding that GPT-5 achieves the best overall performance but still lags human coordination on complex tasks; Chain-of-Thought prompting helps mainly for strong models and can destabilize weaker ones. Abstract planning tasks reveal near-optimal scheduling for several LLMs, underscoring strong high-level reasoning but exposing a gap between abstract planning and embodied execution, motivating hierarchical planning as a bridge. ParaCook provides a scalable framework for evaluating time-efficient, coordination-aware planning and offers a public codebase to foster further development in time-efficient multi-agent planning.

Abstract

Large Language Models (LLMs) exhibit strong reasoning abilities for planning long-horizon, real-world tasks, yet existing agent benchmarks focus on task completion while neglecting time efficiency in parallel and asynchronous operations. To address this, we present ParaCook, a benchmark for time-efficient collaborative planning. Inspired by the Overcooked game, ParaCook provides an environment for various challenging interaction planning of multi-agent systems that are instantiated as cooking tasks, with a simplified action space to isolate the core challenge of strategic parallel planning. Through a comprehensive evaluation of state-of-the-art LLMs, we find that current approaches achieve suboptimal plans, which struggle with parallel actions or coordination. Our analysis also reveals LLMs' potential on abstract tasks where they can focus on high-level parallel optimization. ParaCook provides a scalable evaluation framework with adjustable complexity, establishing a foundation for developing and assessing time efficiency-aware multi-agent planning. The code and data are available at https://github.com/zsq259/ParaCook.

Paper Structure

This paper contains 71 sections, 8 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of the ParaCook benchmark, showing (a) the benchmark pipeline, (b) comparison of planning strategies, and (c) evaluation metrics for model performance.
  • Figure 2: Results on the human-evaluated subset for LLM-human comparison. Detailed scores are in Table \ref{['tab:human_subset_results']} and \ref{['tab:human_subset_results_normalized_per_difficulty']}.
  • Figure 3: Model performance across different task complexities. Top row (a1-a5): varying number of agents. Bottom row (b1-b5): varying number of orders. Each column corresponds to the five metrics.