Unifying Deep Predicate Invention with Pre-trained Foundation Models
Qianwei Wang, Bowen Li, Zhanpeng Luo, Yifan Xu, Alexander Gray, Tom Silver, Sebastian Scherer, Katia Sycara, Yaqi Xie
TL;DR
UniPred tackles long-horizon robotic planning by unifying top-down predicate proposals from foundation models with bottom-up data-driven refinement, enabling robust symbolic world models in cluttered and non-STRIPS domains. The method iteratively refines LLM-generated predicate hypotheses via effect-based learning on low-level transitions and groundings, while leveraging strong visual features to ground predicates from images. A key contribution is the derived-aware predicate selection that distinguishes basic versus derived predicates to maintain planner reliability in non-STRIPS settings. Across simulated and real-world tasks, UniPred delivers 2-4x higher training success than top-down methods and 3-4x faster learning than bottom-up approaches, demonstrating scalable, foundation-model-grounded abstraction for long-horizon robotic planning.
Abstract
Long-horizon robotic tasks are hard due to continuous state-action spaces and sparse feedback. Symbolic world models help by decomposing tasks into discrete predicates that capture object properties and relations. Existing methods learn predicates either top-down, by prompting foundation models without data grounding, or bottom-up, from demonstrations without high-level priors. We introduce UniPred, a bilevel learning framework that unifies both. UniPred uses large language models (LLMs) to propose predicate effect distributions that supervise neural predicate learning from low-level data, while learned feedback iteratively refines the LLM hypotheses. Leveraging strong visual foundation model features, UniPred learns robust predicate classifiers in cluttered scenes. We further propose a predicate evaluation method that supports symbolic models beyond STRIPS assumptions. Across five simulated and one real-robot domains, UniPred achieves 2-4 times higher success rates than top-down methods and 3-4 times faster learning than bottom-up approaches, advancing scalable and flexible symbolic world modeling for robotics.
