Table of Contents
Fetching ...

ManiAgent: An Agentic Framework for General Robotic Manipulation

Yi Yang, Kefan Gu, Yuqing Wen, Hebei Li, Yucheng Zhao, Tiancai Wang, Xudong Liu

TL;DR

ManiAgent introduces a training-free, agentic framework for general robotic manipulation that decomposes tasks into perception, reasoning, and control via specialized agents, enabling end-to-end executable action generation from task descriptions and environmental inputs. A scene-perception module using a Vision-Language Model, a reasoning-and-planning module using an LLM, and an object-perception module with open-vocabulary detection collaborate to produce action sequences, with a caching mechanism to accelerate repeated subtasks. Empirical results show ManiAgent achieves 86.8% average success in SimplerEnv simulations and up to 95.8% average success in real-world tasks with strong VLMs, while also serving as an automated data-collection tool to train VLA systems with performance comparable to human-annotated data. The framework reduces data requirements, improves generalization for long-horizon manipulation, and offers a flexible, end-to-end approach that can extend to automated data generation and broader robotic platforms.

Abstract

While Vision-Language-Action (VLA) models have demonstrated impressive capabilities in robotic manipulation, their performance in complex reasoning and long-horizon task planning is limited by data scarcity and model capacity. To address this, we introduce ManiAgent, an agentic architecture for general manipulation tasks that achieves end-to-end output from task descriptions and environmental inputs to robotic manipulation actions. In this framework, multiple agents involve inter-agent communication to perform environmental perception, sub-task decomposition and action generation, enabling efficient handling of complex manipulation scenarios. Evaluations show ManiAgent achieves an 86.8% success rate on the SimplerEnv benchmark and 95.8% on real-world pick-and-place tasks, enabling efficient data collection that yields VLA models with performance comparable to those trained on human-annotated datasets. The project webpage is available at https://yi-yang929.github.io/ManiAgent/.

ManiAgent: An Agentic Framework for General Robotic Manipulation

TL;DR

ManiAgent introduces a training-free, agentic framework for general robotic manipulation that decomposes tasks into perception, reasoning, and control via specialized agents, enabling end-to-end executable action generation from task descriptions and environmental inputs. A scene-perception module using a Vision-Language Model, a reasoning-and-planning module using an LLM, and an object-perception module with open-vocabulary detection collaborate to produce action sequences, with a caching mechanism to accelerate repeated subtasks. Empirical results show ManiAgent achieves 86.8% average success in SimplerEnv simulations and up to 95.8% average success in real-world tasks with strong VLMs, while also serving as an automated data-collection tool to train VLA systems with performance comparable to human-annotated data. The framework reduces data requirements, improves generalization for long-horizon manipulation, and offers a flexible, end-to-end approach that can extend to automated data generation and broader robotic platforms.

Abstract

While Vision-Language-Action (VLA) models have demonstrated impressive capabilities in robotic manipulation, their performance in complex reasoning and long-horizon task planning is limited by data scarcity and model capacity. To address this, we introduce ManiAgent, an agentic architecture for general manipulation tasks that achieves end-to-end output from task descriptions and environmental inputs to robotic manipulation actions. In this framework, multiple agents involve inter-agent communication to perform environmental perception, sub-task decomposition and action generation, enabling efficient handling of complex manipulation scenarios. Evaluations show ManiAgent achieves an 86.8% success rate on the SimplerEnv benchmark and 95.8% on real-world pick-and-place tasks, enabling efficient data collection that yields VLA models with performance comparable to those trained on human-annotated datasets. The project webpage is available at https://yi-yang929.github.io/ManiAgent/.

Paper Structure

This paper contains 17 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: ManiAgent decomposes the Menemen(a pepper-and-egg dish) ingredient-finding task into perception, reasoning, and execution handled by dedicated agents.
  • Figure 2: Overview of the ManiAgent framework. 1) The process begins with the perception agent, which takes scene images and user-provided instructions as input, and invokes a Vision-Language Model (VLM) to generate task-relevant scene descriptions. 2) The reasoning agent receives the scene descriptions and task instructions, then queries a Large Language Model (LLM) for status evaluation. 3) During sub-task execution, the perception agent uses object detection methods to identify target objects and retrieve detailed information. 4) The controller agent queries the cache based on the sub-task. If a matching cached action sequence is found, it is directly invoked; otherwise, the agent queries the LLM with the sub-task description and object details to generate a complete action sequence for execution.
  • Figure 3: The perception module of ManiAgent processes the target object list from the upper-level module with scene images, depth maps, and camera parameters to obtain object coordinates and grasping poses (using VLM for screening identical objects when needed), and finally sends text-format object information to the next module.
  • Figure 4: Task execution process in the simplerenv simulation environment
  • Figure 5: Definition and scenario examples of real-world manipulation tasks
  • ...and 1 more figures