Table of Contents
Fetching ...

Unifying Deep Predicate Invention with Pre-trained Foundation Models

Qianwei Wang, Bowen Li, Zhanpeng Luo, Yifan Xu, Alexander Gray, Tom Silver, Sebastian Scherer, Katia Sycara, Yaqi Xie

TL;DR

UniPred tackles long-horizon robotic planning by unifying top-down predicate proposals from foundation models with bottom-up data-driven refinement, enabling robust symbolic world models in cluttered and non-STRIPS domains. The method iteratively refines LLM-generated predicate hypotheses via effect-based learning on low-level transitions and groundings, while leveraging strong visual features to ground predicates from images. A key contribution is the derived-aware predicate selection that distinguishes basic versus derived predicates to maintain planner reliability in non-STRIPS settings. Across simulated and real-world tasks, UniPred delivers 2-4x higher training success than top-down methods and 3-4x faster learning than bottom-up approaches, demonstrating scalable, foundation-model-grounded abstraction for long-horizon robotic planning.

Abstract

Long-horizon robotic tasks are hard due to continuous state-action spaces and sparse feedback. Symbolic world models help by decomposing tasks into discrete predicates that capture object properties and relations. Existing methods learn predicates either top-down, by prompting foundation models without data grounding, or bottom-up, from demonstrations without high-level priors. We introduce UniPred, a bilevel learning framework that unifies both. UniPred uses large language models (LLMs) to propose predicate effect distributions that supervise neural predicate learning from low-level data, while learned feedback iteratively refines the LLM hypotheses. Leveraging strong visual foundation model features, UniPred learns robust predicate classifiers in cluttered scenes. We further propose a predicate evaluation method that supports symbolic models beyond STRIPS assumptions. Across five simulated and one real-robot domains, UniPred achieves 2-4 times higher success rates than top-down methods and 3-4 times faster learning than bottom-up approaches, advancing scalable and flexible symbolic world modeling for robotics.

Unifying Deep Predicate Invention with Pre-trained Foundation Models

TL;DR

UniPred tackles long-horizon robotic planning by unifying top-down predicate proposals from foundation models with bottom-up data-driven refinement, enabling robust symbolic world models in cluttered and non-STRIPS domains. The method iteratively refines LLM-generated predicate hypotheses via effect-based learning on low-level transitions and groundings, while leveraging strong visual features to ground predicates from images. A key contribution is the derived-aware predicate selection that distinguishes basic versus derived predicates to maintain planner reliability in non-STRIPS settings. Across simulated and real-world tasks, UniPred delivers 2-4x higher training success than top-down methods and 3-4x faster learning than bottom-up approaches, demonstrating scalable, foundation-model-grounded abstraction for long-horizon robotic planning.

Abstract

Long-horizon robotic tasks are hard due to continuous state-action spaces and sparse feedback. Symbolic world models help by decomposing tasks into discrete predicates that capture object properties and relations. Existing methods learn predicates either top-down, by prompting foundation models without data grounding, or bottom-up, from demonstrations without high-level priors. We introduce UniPred, a bilevel learning framework that unifies both. UniPred uses large language models (LLMs) to propose predicate effect distributions that supervise neural predicate learning from low-level data, while learned feedback iteratively refines the LLM hypotheses. Leveraging strong visual foundation model features, UniPred learns robust predicate classifiers in cluttered scenes. We further propose a predicate evaluation method that supports symbolic models beyond STRIPS assumptions. Across five simulated and one real-robot domains, UniPred achieves 2-4 times higher success rates than top-down methods and 3-4 times faster learning than bottom-up approaches, advancing scalable and flexible symbolic world modeling for robotics.

Paper Structure

This paper contains 24 sections, 7 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: (a) Planning Framework: Given an initial RGB image, UniPred extracts object-centric features using a foundation model (e.g., DinoV3 simeoni2025dinov3) and abstracts them into invented ground atoms. These atoms are input into the learned Bilevel Planner to generate a hybrid plan toward the goal. In intermediate steps, UniPred infers corresponding atoms from observed images to facilitate potential replanning. (b) Predicate Invention: UniPred effectively invents predicates by unifying top-down knowledge from the foundation model achiam2023gpt with bottom-up feedback derived from data.
  • Figure 2: An example of train and test tasks in the table-cleaning domain. During training, both of the two toys are initially on table and the wiper is in the box. During test, the system needs to plan with more toys whose configurations are randomly initialized.
  • Figure 3: Overview of the UniPred Framework: Training and Inference. The framework consists of two main phases. Training (Left) utilizes an offline demonstration dataset, where we first prompt a Large Language Model (LLM) to complete a partial PDDL domain specification. This is followed by an LLM-in-the-loop bilevel learning framework designed to construct a comprehensive set of predicate candidates, which are then refined through a final down-selection step. During Inference (Right), UniPred processes input images by extracting robust visual features from object-centric regions using a VFM simeoni2025dinov3, which are converted to the initial set of ground predicates. Based on the predicates, the learned bilevel planner then generates the initial execution plan and optionally replans when skill execution failures are observed.
  • Figure 4: Effect supervised training process for the classifier of predicate $p_{\text{robot}}(\text{?robot})$. The left panel shows three consecutive states during a pick and place sequence, and the right panel illustrates how an incorrect effect specification leads to inconsistent labels across these states, which increases the training and validation loss.
  • Figure 5: Overview of unified bilevel learning. The top row depicts four image states from the table-cleaning task: the robot hand is empty in states 0, 1, and 3, while it holds a towel in state 2. The bottom row visualizes the evolution of DINO image features through the MLP classifier. Left: The original 2D projection, where the four states are not clearly separable. Middle (Iteration 1): After the LLM proposes symbolic knowledge and generates training labels, the hidden features begin to exhibit structure, though state 2 remains partially entangled. Right (Iteration 2): The task loss is fed back to the LLM, prompting an update to the predicate specification and labels; this yields a refined hidden representation where state 2 is cleanly separated from the empty-hand states.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Definition 1: Operator
  • Definition 2: Sampler