Table of Contents
Fetching ...

PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks

Matthew Chang, Gunjan Chhablani, Alexander Clegg, Mikael Dallaire Cote, Ruta Desai, Michal Hlavac, Vladimir Karashchuk, Jacob Krantz, Roozbeh Mottaghi, Priyam Parashar, Siddharth Patki, Ishita Prasad, Xavier Puig, Akshara Rai, Ram Ramrakhya, Daniel Tran, Joanne Truong, John M. Turner, Eric Undersander, Tsung-Yen Yang

TL;DR

PARTNR addresses the need for scalable benchmarks to study planning and reasoning in embodied human-robot collaboration within household tasks. It introduces a semi-automated, simulation-grounded pipeline to generate 100k natural-language tasks and tailored evaluation functions across 60 houses, enabling rigorous analysis of planning, perception, and skill execution. The study systematically compares LLM-based planners against human performance, revealing substantial coordination gaps in current models and highlighting that small, fine-tuned LLMs can reach parity with larger models while offering faster inference, especially in HITL settings. Human-in-the-loop experiments show humans still outperform LLM-guided partners, pointing to future work on grounding, coordination, and robust perception to close the gap toward practical collaborative agents.

Abstract

We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We employ a semi-automated task generation pipeline using Large Language Models (LLMs), incorporating simulation in the loop for grounding and verification. PARTNR stands as the largest benchmark of its kind, comprising 100,000 natural language tasks, spanning 60 houses and 5,819 unique objects. We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution. The analysis reveals significant limitations in SoTA models, such as poor coordination and failures in task tracking and recovery from errors. When LLMs are paired with real humans, they require 1.5x as many steps as two humans collaborating and 1.1x more steps than a single human, underscoring the potential for improvement in these models. We further show that fine-tuning smaller LLMs with planning data can achieve performance on par with models 9 times larger, while being 8.6x faster at inference. Overall, PARTNR highlights significant challenges facing collaborative embodied agents and aims to drive research in this direction.

PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks

TL;DR

PARTNR addresses the need for scalable benchmarks to study planning and reasoning in embodied human-robot collaboration within household tasks. It introduces a semi-automated, simulation-grounded pipeline to generate 100k natural-language tasks and tailored evaluation functions across 60 houses, enabling rigorous analysis of planning, perception, and skill execution. The study systematically compares LLM-based planners against human performance, revealing substantial coordination gaps in current models and highlighting that small, fine-tuned LLMs can reach parity with larger models while offering faster inference, especially in HITL settings. Human-in-the-loop experiments show humans still outperform LLM-guided partners, pointing to future work on grounding, coordination, and robust perception to close the gap toward practical collaborative agents.

Abstract

We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We employ a semi-automated task generation pipeline using Large Language Models (LLMs), incorporating simulation in the loop for grounding and verification. PARTNR stands as the largest benchmark of its kind, comprising 100,000 natural language tasks, spanning 60 houses and 5,819 unique objects. We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution. The analysis reveals significant limitations in SoTA models, such as poor coordination and failures in task tracking and recovery from errors. When LLMs are paired with real humans, they require 1.5x as many steps as two humans collaborating and 1.1x more steps than a single human, underscoring the potential for improvement in these models. We further show that fine-tuning smaller LLMs with planning data can achieve performance on par with models 9 times larger, while being 8.6x faster at inference. Overall, PARTNR highlights significant challenges facing collaborative embodied agents and aims to drive research in this direction.

Paper Structure

This paper contains 58 sections, 8 equations, 14 figures, 14 tables.

Figures (14)

  • Figure 1: We present PARTNR, a benchmark for planning and reasoning in embodied multi-agent tasks, featuring 100,000 everyday tasks and evaluation functions generated semi-automatically, spanning 60 houses and 5,819 unique objects. We analyze LLM-based planning agents and also provide a human-in-the-loop tool to evaluate how agents collaborate with real humans.
  • Figure 2: The PARTNR generation pipeline. Task and evaluation generators produce episodes, which are filtered and annotated for correctness. These episodes are then treated as seeds to achieve 100k-scale. Finally, episodes are vetted during human-in-the-loop collection.
  • Figure 3: Task and evaluation example. Language tasks have inherent complexity and ambiguity; both of which are supported by the structures of our evaluation functions.
  • Figure 4: Distribution of task types in PARTNR. The left plot displays the percentage of tasks with each characteristic. Constraint-free tasks by definition exclude the other types. The top right bars correspond to the dot combination below.
  • Figure 5: Decentralized architecture. The human and robot agents use a 2-layer hierarchical architecture, with high-level LLM planners that call low-level skills. Both agents build a world graph, updated using observations and actions.
  • ...and 9 more figures