Table of Contents
Fetching ...

UniManip: General-Purpose Zero-Shot Robotic Manipulation with Agentic Operational Graph

Haichao Liu, Yuanjiang Xue, Yuheng Zhou, Haoyuan Deng, Yinan Liang, Lihua Xie, Ziwei Wang

TL;DR

Unimanip is a framework grounded in a Bi-level Agentic Operational Graph (AOG) that unifies semantic reasoning and physical grounding that enables direct zero-shot transfer from fixed-base setups to mobile manipulation without fine-tuning or reconfiguration.

Abstract

Achieving general-purpose robotic manipulation requires robots to seamlessly bridge high-level semantic intent with low-level physical interaction in unstructured environments. However, existing approaches falter in zero-shot generalization: end-to-end Vision-Language-Action (VLA) models often lack the precision required for long-horizon tasks, while traditional hierarchical planners suffer from semantic rigidity when facing open-world variations. To address this, we present UniManip, a framework grounded in a Bi-level Agentic Operational Graph (AOG) that unifies semantic reasoning and physical grounding. By coupling a high-level Agentic Layer for task orchestration with a low-level Scene Layer for dynamic state representation, the system continuously aligns abstract planning with geometric constraints, enabling robust zero-shot execution. Unlike static pipelines, UniManip operates as a dynamic agentic loop: it actively instantiates object-centric scene graphs from unstructured perception, parameterizes these representations into collision-free trajectories via a safety-aware local planner, and exploits structured memory to autonomously diagnose and recover from execution failures. Extensive experiments validate the system's robust zero-shot capability on unseen objects and tasks, demonstrating a 22.5% and 25.0% higher success rate compared to state-of-the-art VLA and hierarchical baselines, respectively. Notably, the system enables direct zero-shot transfer from fixed-base setups to mobile manipulation without fine-tuning or reconfiguration. Our open-source project page can be found at https://henryhcliu.github.io/unimanip.

UniManip: General-Purpose Zero-Shot Robotic Manipulation with Agentic Operational Graph

TL;DR

Unimanip is a framework grounded in a Bi-level Agentic Operational Graph (AOG) that unifies semantic reasoning and physical grounding that enables direct zero-shot transfer from fixed-base setups to mobile manipulation without fine-tuning or reconfiguration.

Abstract

Achieving general-purpose robotic manipulation requires robots to seamlessly bridge high-level semantic intent with low-level physical interaction in unstructured environments. However, existing approaches falter in zero-shot generalization: end-to-end Vision-Language-Action (VLA) models often lack the precision required for long-horizon tasks, while traditional hierarchical planners suffer from semantic rigidity when facing open-world variations. To address this, we present UniManip, a framework grounded in a Bi-level Agentic Operational Graph (AOG) that unifies semantic reasoning and physical grounding. By coupling a high-level Agentic Layer for task orchestration with a low-level Scene Layer for dynamic state representation, the system continuously aligns abstract planning with geometric constraints, enabling robust zero-shot execution. Unlike static pipelines, UniManip operates as a dynamic agentic loop: it actively instantiates object-centric scene graphs from unstructured perception, parameterizes these representations into collision-free trajectories via a safety-aware local planner, and exploits structured memory to autonomously diagnose and recover from execution failures. Extensive experiments validate the system's robust zero-shot capability on unseen objects and tasks, demonstrating a 22.5% and 25.0% higher success rate compared to state-of-the-art VLA and hierarchical baselines, respectively. Notably, the system enables direct zero-shot transfer from fixed-base setups to mobile manipulation without fine-tuning or reconfiguration. Our open-source project page can be found at https://henryhcliu.github.io/unimanip.
Paper Structure (34 sections, 23 equations, 12 figures, 6 tables)

This paper contains 34 sections, 23 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: UniManip achieves robust, general-purpose robotic manipulation in open-world settings. The system supports zero-shot transfer across diverse embodiments (fixed and mobile) and utilizes a graph-based agentic workflow to adapt to errors during long-horizon tasks, ensuring high success rates without reconfiguration.
  • Figure 2: Overview of the UniManip framework. The system integrates high-level task planning with low-level motion execution through an Agentic Operational Graph (AOG), illustrated at the agent level. The VLM interprets human commands to generate an operational graph, which guides the robot's actions. A reflective recovery mechanism allows the system to diagnose and adapt to execution failures.
  • Figure 3: The structure and workflow of the proposed bi-level agentic operational graph. The upper layer shows the AI agent with five nodes and conditional directed edges as the ALG. The lower layer shows the structured semantic understanding of the environment described by the SOSG.
  • Figure 4: Demonstration of the spatial operations of the robot, with an instance of opening a drawer. The task is decomposed into several tool invocations, and each tool has its specific spatial operational formats for the movement of the robotic manipulator.
  • Figure 5: Visualization of the conservative volumetric occupancy grid $\mathcal{M}_{final}$ generated from a single-view RGB-D observation. The gravity-aligned completion over-approximates unknown space, improving safety under occlusion.
  • ...and 7 more figures