Table of Contents
Fetching ...

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani, Babak Khalaj

TL;DR

Experiments in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks.

Abstract

Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings.MALLVi present a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step Rather than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning.Experiments in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks.Code available at https://github.com/iman1234ahmadi/MALLVI.

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

TL;DR

Experiments in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks.

Abstract

Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings.MALLVi present a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step Rather than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning.Experiments in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks.Code available at https://github.com/iman1234ahmadi/MALLVI.
Paper Structure (32 sections, 5 equations, 13 figures, 6 tables)

This paper contains 32 sections, 5 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: The MALLVi framework architecture. The pipeline processes user prompts through specialized agents: Decompose breaks instructions into atomic steps, Describe provides scene understanding, Perceive processes visual inputs, Ground localizes target objects, Project generates motion trajectories, Think coordinates high-level reasoning, Act executes robotic commands, and Reflect evaluates outcomes to enable iterative refinement and error recovery.
  • Figure 2: Comparison between single-agent and multi-agent frameworks.
  • Figure 3: Analysis of specialized agents and their roles in a multi-agent system. Each agent functions at a designated level (high, mid, or low) to address specific components of task execution, including instruction decomposition, memory utilization, object localization, task reasoning, action execution, and closed-loop feedback provision.
  • Figure 4: Example of our real-world tasks. Stack Blocks, Sort Shape, and Math Operation each combine a specific prompt with a physical environment to assess an agent’s ability to act and solve problems in tangible settings.
  • Figure 5: A real-world example of the Stack Blocks task. MALLVi is asked to stack the blocks in the order red, blue and green. The wooden block acts as a distraction.
  • ...and 8 more figures