Table of Contents
Fetching ...

Multi-Robot Task Planning for Multi-Object Retrieval Tasks with Distributed On-Site Knowledge via Large Language Models

Kento Murata, Shoichi Hasegawa, Tomochika Ishikawa, Yoshinobu Hagiwara, Akira Taniguchi, Lotfi El Hafi, Tadahiro Taniguchi

TL;DR

This work tackles multi-robot task planning under distributed on-site knowledge by marrying large language models with a spatial concept model that encodes room names and room-wise object presence probabilities. The proposed pipeline decomposes NL instructions, allocates subtasks to robots based on environment-grounded knowledge, plans actions sequentially, and executes with a closed-loop feedback mechanism via FlexBE. Realistic simulations and RoboCup @Home experiments show that grounding allocations in spatial concepts substantially improves assignment accuracy (47/50 vs 28/50 and 26/50 baselines) and enables handling ambiguous commands such as field-trip preparation. The approach advances scalable, robust multi-robot coordination in household-like settings and points to future enhancements including heterogeneous teams and dynamic reallocation for more complex, real-world deployments.

Abstract

It is crucial to efficiently execute instructions such as "Find an apple and a banana" or "Get ready for a field trip," which require searching for multiple objects or understanding context-dependent commands. This study addresses the challenging problem of determining which robot should be assigned to which part of a task when each robot possesses different situational on-site knowledge-specifically, spatial concepts learned from the area designated to it by the user. We propose a task planning framework that leverages large language models (LLMs) and spatial concepts to decompose natural language instructions into subtasks and allocate them to multiple robots. We designed a novel few-shot prompting strategy that enables LLMs to infer required objects from ambiguous commands and decompose them into appropriate subtasks. In our experiments, the proposed method achieved 47/50 successful assignments, outperforming random (28/50) and commonsense-based assignment (26/50). Furthermore, we conducted qualitative evaluations using two actual mobile manipulators. The results demonstrated that our framework could handle instructions, including those involving ad hoc categories such as "Get ready for a field trip," by successfully performing task decomposition, assignment, sequential planning, and execution.

Multi-Robot Task Planning for Multi-Object Retrieval Tasks with Distributed On-Site Knowledge via Large Language Models

TL;DR

This work tackles multi-robot task planning under distributed on-site knowledge by marrying large language models with a spatial concept model that encodes room names and room-wise object presence probabilities. The proposed pipeline decomposes NL instructions, allocates subtasks to robots based on environment-grounded knowledge, plans actions sequentially, and executes with a closed-loop feedback mechanism via FlexBE. Realistic simulations and RoboCup @Home experiments show that grounding allocations in spatial concepts substantially improves assignment accuracy (47/50 vs 28/50 and 26/50 baselines) and enables handling ambiguous commands such as field-trip preparation. The approach advances scalable, robust multi-robot coordination in household-like settings and points to future enhancements including heterogeneous teams and dynamic reallocation for more complex, real-world deployments.

Abstract

It is crucial to efficiently execute instructions such as "Find an apple and a banana" or "Get ready for a field trip," which require searching for multiple objects or understanding context-dependent commands. This study addresses the challenging problem of determining which robot should be assigned to which part of a task when each robot possesses different situational on-site knowledge-specifically, spatial concepts learned from the area designated to it by the user. We propose a task planning framework that leverages large language models (LLMs) and spatial concepts to decompose natural language instructions into subtasks and allocate them to multiple robots. We designed a novel few-shot prompting strategy that enables LLMs to infer required objects from ambiguous commands and decompose them into appropriate subtasks. In our experiments, the proposed method achieved 47/50 successful assignments, outperforming random (28/50) and commonsense-based assignment (26/50). Furthermore, we conducted qualitative evaluations using two actual mobile manipulators. The results demonstrated that our framework could handle instructions, including those involving ad hoc categories such as "Get ready for a field trip," by successfully performing task decomposition, assignment, sequential planning, and execution.

Paper Structure

This paper contains 26 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of the research problem setting illustrated in four panels (a–d): (a) The environment is divided and learned by different robots. (b) A user gives a high-level natural language instruction. (c) Each robot searches for the requested objects based on its own knowledge. (d) The robots return with the objects, completing the instruction. For example, two mobile manipulators cooperatively accomplish the instruction "I want to make a fruit smoothie, so please find an apple and a banana" in parallel by leveraging their floor-specific knowledge. The performance of this framework is validated through real-world robot experiments described later.
  • Figure 2: Overview of the proposed pipeline: (a) Each robot organizes its on-site knowledge—room names, vocabulary, and object occurrence probabilities—using the spatial concept model; (b) GPT-4 uses this knowledge as prompt components to perform subtask decomposition and allocation; (c) action-planning prompts are generated accordingly; and (d) FlexBE executes the actions via a state machine and returns feedback.
  • Figure 3: Illustration of how the robot infers both the frequently observed place-related vocabulary and the object–location existence probabilities in each area based on sensor observations.
  • Figure 4: Execution–replanning loop: Skills generated by GPT-4 are executed as FlexBE state machines. Each execution result and observation (e.g., an undetected object, a blocked path, or a grasp failure) is immediately fed back to GPT-4, which replans accordingly. Since each robot operates its own loop independently, parallel task execution is possible based on prior allocation decisions.
  • Figure 5: Object placement in the evaluation environment created using Sweet Home 3D. The environment consists of 10 rooms (1F: Living Room, Dining Room, Entrance, Kitchen, Office Room; 2F: Bathroom, Child’s Room, Parent Room, Front of Stairs, Corridor). 24 objects were placed. Objects labeled as "common-sense placed" were deemed appropriate for their rooms by GPT-4, while "hard-to-predict objects" require spatial knowledge for accurate localization.
  • ...and 3 more figures