Table of Contents
Fetching ...

UniPlan: Vision-Language Task Planning for Mobile Manipulation with Unified PDDL Formulation

Haoming Ye, Yunxiao Xiao, Cewu Lu, Panpan Cai

TL;DR

UniPlan tackles long-horizon mobile manipulation in large-scale indoor environments by unifying vision-language grounding with symbolic planning in a single PDDL framework. It builds a vision-anchored visual-topological map and programmatically expands a learned tabletop UniDomain to handle navigation, door traversal, and bimanual actions, grounding task-relevant objects via VLMs and compressing the map into a task-oriented PDDL problem. A unified PDDL planner then computes globally cost-aware plans, which are refined into executable navigation and manipulation sequences; experiments on real scenes show significant gains in success rate and plan quality with reduced latency, achieving an average SR of approximately 83.75% and RPQG improvements across baselines. The approach advances scalable, robust planning for real-world indoor robotics by reducing dependence on mobile demonstrations and enabling joint reasoning over movement and manipulation.

Abstract

Integration of VLM reasoning with symbolic planning has proven to be a promising approach to real-world robot task planning. Existing work like UniDomain effectively learns symbolic manipulation domains from real-world demonstrations, described in Planning Domain Definition Language (PDDL), and has successfully applied them to real-world tasks. These domains, however, are restricted to tabletop manipulation. We propose UniPlan, a vision-language task planning system for long-horizon mobile-manipulation in large-scale indoor environments, that unifies scene topology, visuals, and robot capabilities into a holistic PDDL representation. UniPlan programmatically extends learned tabletop domains from UniDomain to support navigation, door traversal, and bimanual coordination. It operates on a visual-topological map, comprising navigation landmarks anchored with scene images. Given a language instruction, UniPlan retrieves task-relevant nodes from the map and uses a VLM to ground the anchored image into task-relevant objects and their PDDL states; next, it reconnects these nodes to a compressed, densely-connected topological map, also represented in PDDL, with connectivity and costs derived from the original map; Finally, a mobile-manipulation plan is generated using off-the-shelf PDDL solvers. Evaluated on human-raised tasks in a large-scale map with real-world imagery, UniPlan significantly outperforms VLM and LLM+PDDL planning in success rate, plan quality, and computational efficiency.

UniPlan: Vision-Language Task Planning for Mobile Manipulation with Unified PDDL Formulation

TL;DR

UniPlan tackles long-horizon mobile manipulation in large-scale indoor environments by unifying vision-language grounding with symbolic planning in a single PDDL framework. It builds a vision-anchored visual-topological map and programmatically expands a learned tabletop UniDomain to handle navigation, door traversal, and bimanual actions, grounding task-relevant objects via VLMs and compressing the map into a task-oriented PDDL problem. A unified PDDL planner then computes globally cost-aware plans, which are refined into executable navigation and manipulation sequences; experiments on real scenes show significant gains in success rate and plan quality with reduced latency, achieving an average SR of approximately 83.75% and RPQG improvements across baselines. The approach advances scalable, robust planning for real-world indoor robotics by reducing dependence on mobile demonstrations and enabling joint reasoning over movement and manipulation.

Abstract

Integration of VLM reasoning with symbolic planning has proven to be a promising approach to real-world robot task planning. Existing work like UniDomain effectively learns symbolic manipulation domains from real-world demonstrations, described in Planning Domain Definition Language (PDDL), and has successfully applied them to real-world tasks. These domains, however, are restricted to tabletop manipulation. We propose UniPlan, a vision-language task planning system for long-horizon mobile-manipulation in large-scale indoor environments, that unifies scene topology, visuals, and robot capabilities into a holistic PDDL representation. UniPlan programmatically extends learned tabletop domains from UniDomain to support navigation, door traversal, and bimanual coordination. It operates on a visual-topological map, comprising navigation landmarks anchored with scene images. Given a language instruction, UniPlan retrieves task-relevant nodes from the map and uses a VLM to ground the anchored image into task-relevant objects and their PDDL states; next, it reconnects these nodes to a compressed, densely-connected topological map, also represented in PDDL, with connectivity and costs derived from the original map; Finally, a mobile-manipulation plan is generated using off-the-shelf PDDL solvers. Evaluated on human-raised tasks in a large-scale map with real-world imagery, UniPlan significantly outperforms VLM and LLM+PDDL planning in success rate, plan quality, and computational efficiency.
Paper Structure (68 sections, 4 figures, 4 tables)

This paper contains 68 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of UniPlan. See detailed descriptions in Section \ref{['sec:overview']}.
  • Figure 2: AST-based operator rewriting for domain expansion on pick_from_bowl. Left: the parsed AST exposing the syntactic structure of parameters, precondition, and effect. Middle: the original action schema. Right: the expanded schema after rewriting, where blue denotes arm-utility predicates, red denotes topological-state predicates, and orange denotes cost predicates.
  • Figure 3: Visual-topological map example, where yellow nodes represent pose nodes, blue nodes represent room nodes, and images denote asset nodes.
  • Figure 4: The compressed map generated through task-oriented retrieval. It focuses on the coffee maker, office 602 table (for cups), and the meeting table, while maintaining the minimal topological connectivity required for navigation.