Table of Contents
Fetching ...

MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation

Harsh Singh, Rocktim Jyoti Das, Mingfei Han, Preslav Nakov, Ivan Laptev

TL;DR

The paper addresses the brittleness of single-agent LLM planning in robotics by introducing MALMM, a multi-agent LLM framework with specialized Planner, Coder, and Supervisor roles that leverage environment feedback after each step to enable adaptive re-planning. MALMM demonstrates strong zero-shot generalization across nine RLBench tasks and real-world trials, outperforming state-of-the-art baselines and showing robustness to intermediate failures and long-horizon planning. Key contributions include the first multi-agent LLM framework for robotic manipulation, a zero-shot prompting strategy without in-context examples, and comprehensive ablations revealing the benefits of role specialization and dynamic supervision. The work advances practical zero-shot manipulation by mitigating hallucinations and enabling adaptive execution, with implications for scalable, language-guided robotics.

Abstract

Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation. While recent efforts in robotics have leveraged LLMs both for high-level and low-level planning, these approaches often face significant challenges, such as hallucinations in long-horizon tasks and limited adaptability due to the generation of plans in a single pass without real-time feedback. To address these limitations, we propose a novel multi-agent LLM framework, Multi-Agent Large Language Model for Manipulation (MALMM) that distributes high-level planning and low-level control code generation across specialized LLM agents, supervised by an additional agent that dynamically manages transitions. By incorporating observations from the environment after each step, our framework effectively handles intermediate failures and enables adaptive re-planning. Unlike existing methods, our approach does not rely on pre-trained skill policies or in-context learning examples and generalizes to a variety of new tasks. We evaluate our approach on nine RLBench tasks, including long-horizon tasks, and demonstrate its ability to solve robotics manipulation in a zero-shot setting, thereby overcoming key limitations of existing LLM-based manipulation methods.

MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation

TL;DR

The paper addresses the brittleness of single-agent LLM planning in robotics by introducing MALMM, a multi-agent LLM framework with specialized Planner, Coder, and Supervisor roles that leverage environment feedback after each step to enable adaptive re-planning. MALMM demonstrates strong zero-shot generalization across nine RLBench tasks and real-world trials, outperforming state-of-the-art baselines and showing robustness to intermediate failures and long-horizon planning. Key contributions include the first multi-agent LLM framework for robotic manipulation, a zero-shot prompting strategy without in-context examples, and comprehensive ablations revealing the benefits of role specialization and dynamic supervision. The work advances practical zero-shot manipulation by mitigating hallucinations and enabling adaptive execution, with implications for scalable, language-guided robotics.

Abstract

Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation. While recent efforts in robotics have leveraged LLMs both for high-level and low-level planning, these approaches often face significant challenges, such as hallucinations in long-horizon tasks and limited adaptability due to the generation of plans in a single pass without real-time feedback. To address these limitations, we propose a novel multi-agent LLM framework, Multi-Agent Large Language Model for Manipulation (MALMM) that distributes high-level planning and low-level control code generation across specialized LLM agents, supervised by an additional agent that dynamically manages transitions. By incorporating observations from the environment after each step, our framework effectively handles intermediate failures and enables adaptive re-planning. Unlike existing methods, our approach does not rely on pre-trained skill policies or in-context learning examples and generalizes to a variety of new tasks. We evaluate our approach on nine RLBench tasks, including long-horizon tasks, and demonstrate its ability to solve robotics manipulation in a zero-shot setting, thereby overcoming key limitations of existing LLM-based manipulation methods.

Paper Structure

This paper contains 29 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Examples of executing "Stack four blocks at the green target area" task by the Single Agent LLM (top) and our Multi-Agent MALMM framework (bottom). MALMM recovers after dropping one block and continues stacking above the target area, while the Single Agent mistakenly continues stacking blocks on top of the dropped block.
  • Figure 2: An overview of our multi-agent system, MALMM, which consists of three LLM agents—Planner, Coder, and Supervisor—and a Code executor tool. Each agent operates with a specific system prompt defining its role: (1) the Planner generates high-level plans and replans in case of intermediate failures, (2) the Coder converts these plans into low-level executable code, and (3) the Supervisor coordinates the system by managing the transitions between the Planner, the Coder, and the Code executor.
  • Figure 3: Agents for robotic manipulation: The figure illustrates three LLM-based manipulation frameworks: SA, MA, and MALMM, with the different number of agents in each framework. All three frameworks begin by receiving an input command and the initial environment observation. Each framework iteratively generates a high-level plan along with corresponding low-level code. After each intermediate step, the frameworks use updated environment observation to detect failures and replan as needed until the task is completed.
  • Figure 4: Illustration of the nine RLBench James2019RLBenchTR tasks used in our evaluation, featuring diverse tasks with varying task horizons and different object shapes.
  • Figure 5: Comparison of Single Agent vs. MALMM for variations of the stack blocks task that require stacking 2, 3, or 4 blocks on top of each other.
  • ...and 8 more figures