UniPlan: Vision-Language Task Planning for Mobile Manipulation with Unified PDDL Formulation
Haoming Ye, Yunxiao Xiao, Cewu Lu, Panpan Cai
TL;DR
UniPlan tackles long-horizon mobile manipulation in large-scale indoor environments by unifying vision-language grounding with symbolic planning in a single PDDL framework. It builds a vision-anchored visual-topological map and programmatically expands a learned tabletop UniDomain to handle navigation, door traversal, and bimanual actions, grounding task-relevant objects via VLMs and compressing the map into a task-oriented PDDL problem. A unified PDDL planner then computes globally cost-aware plans, which are refined into executable navigation and manipulation sequences; experiments on real scenes show significant gains in success rate and plan quality with reduced latency, achieving an average SR of approximately 83.75% and RPQG improvements across baselines. The approach advances scalable, robust planning for real-world indoor robotics by reducing dependence on mobile demonstrations and enabling joint reasoning over movement and manipulation.
Abstract
Integration of VLM reasoning with symbolic planning has proven to be a promising approach to real-world robot task planning. Existing work like UniDomain effectively learns symbolic manipulation domains from real-world demonstrations, described in Planning Domain Definition Language (PDDL), and has successfully applied them to real-world tasks. These domains, however, are restricted to tabletop manipulation. We propose UniPlan, a vision-language task planning system for long-horizon mobile-manipulation in large-scale indoor environments, that unifies scene topology, visuals, and robot capabilities into a holistic PDDL representation. UniPlan programmatically extends learned tabletop domains from UniDomain to support navigation, door traversal, and bimanual coordination. It operates on a visual-topological map, comprising navigation landmarks anchored with scene images. Given a language instruction, UniPlan retrieves task-relevant nodes from the map and uses a VLM to ground the anchored image into task-relevant objects and their PDDL states; next, it reconnects these nodes to a compressed, densely-connected topological map, also represented in PDDL, with connectivity and costs derived from the original map; Finally, a mobile-manipulation plan is generated using off-the-shelf PDDL solvers. Evaluated on human-raised tasks in a large-scale map with real-world imagery, UniPlan significantly outperforms VLM and LLM+PDDL planning in success rate, plan quality, and computational efficiency.
