UniManip: General-Purpose Zero-Shot Robotic Manipulation with Agentic Operational Graph

Haichao Liu; Yuanjiang Xue; Yuheng Zhou; Haoyuan Deng; Yinan Liang; Lihua Xie; Ziwei Wang

UniManip: General-Purpose Zero-Shot Robotic Manipulation with Agentic Operational Graph

Haichao Liu, Yuanjiang Xue, Yuheng Zhou, Haoyuan Deng, Yinan Liang, Lihua Xie, Ziwei Wang

TL;DR

Unimanip is a framework grounded in a Bi-level Agentic Operational Graph (AOG) that unifies semantic reasoning and physical grounding that enables direct zero-shot transfer from fixed-base setups to mobile manipulation without fine-tuning or reconfiguration.

Abstract

Achieving general-purpose robotic manipulation requires robots to seamlessly bridge high-level semantic intent with low-level physical interaction in unstructured environments. However, existing approaches falter in zero-shot generalization: end-to-end Vision-Language-Action (VLA) models often lack the precision required for long-horizon tasks, while traditional hierarchical planners suffer from semantic rigidity when facing open-world variations. To address this, we present UniManip, a framework grounded in a Bi-level Agentic Operational Graph (AOG) that unifies semantic reasoning and physical grounding. By coupling a high-level Agentic Layer for task orchestration with a low-level Scene Layer for dynamic state representation, the system continuously aligns abstract planning with geometric constraints, enabling robust zero-shot execution. Unlike static pipelines, UniManip operates as a dynamic agentic loop: it actively instantiates object-centric scene graphs from unstructured perception, parameterizes these representations into collision-free trajectories via a safety-aware local planner, and exploits structured memory to autonomously diagnose and recover from execution failures. Extensive experiments validate the system's robust zero-shot capability on unseen objects and tasks, demonstrating a 22.5% and 25.0% higher success rate compared to state-of-the-art VLA and hierarchical baselines, respectively. Notably, the system enables direct zero-shot transfer from fixed-base setups to mobile manipulation without fine-tuning or reconfiguration. Our open-source project page can be found at https://henryhcliu.github.io/unimanip.

UniManip: General-Purpose Zero-Shot Robotic Manipulation with Agentic Operational Graph

TL;DR

Abstract

Paper Structure (34 sections, 23 equations, 12 figures, 6 tables)

This paper contains 34 sections, 23 equations, 12 figures, 6 tables.

Introduction
Related Work
VLA Models for Manipulation
Hierarchical Methods with Planning before Action
Manipulation Failure Detection and Autonomous Recovery
Embodied Agentic Structure for Robotic Manipulation
Problem Statement
Agentic AI Coordinated Workflow
Semantic-Operational State Graph (SOSG)
Agentic Task and Action Planning
Agentic Spatial Reasoning and Action Parameterization
Integrated Agentic Graph Execution for Robotic Manipulation
Graph-Grounded Scene Parsing and Task Decomposition
Task Decomposition
Dynamic SOSG Update and Scene Parsing
...and 19 more sections

Figures (12)

Figure 1: UniManip achieves robust, general-purpose robotic manipulation in open-world settings. The system supports zero-shot transfer across diverse embodiments (fixed and mobile) and utilizes a graph-based agentic workflow to adapt to errors during long-horizon tasks, ensuring high success rates without reconfiguration.
Figure 2: Overview of the UniManip framework. The system integrates high-level task planning with low-level motion execution through an Agentic Operational Graph (AOG), illustrated at the agent level. The VLM interprets human commands to generate an operational graph, which guides the robot's actions. A reflective recovery mechanism allows the system to diagnose and adapt to execution failures.
Figure 3: The structure and workflow of the proposed bi-level agentic operational graph. The upper layer shows the AI agent with five nodes and conditional directed edges as the ALG. The lower layer shows the structured semantic understanding of the environment described by the SOSG.
Figure 4: Demonstration of the spatial operations of the robot, with an instance of opening a drawer. The task is decomposed into several tool invocations, and each tool has its specific spatial operational formats for the movement of the robotic manipulator.
Figure 5: Visualization of the conservative volumetric occupancy grid $\mathcal{M}_{final}$ generated from a single-view RGB-D observation. The gravity-aligned completion over-approximates unknown space, improving safety under occlusion.
...and 7 more figures

UniManip: General-Purpose Zero-Shot Robotic Manipulation with Agentic Operational Graph

TL;DR

Abstract

UniManip: General-Purpose Zero-Shot Robotic Manipulation with Agentic Operational Graph

Authors

TL;DR

Abstract

Table of Contents

Figures (12)