Table of Contents
Fetching ...

From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Jiuguang Wang, Tomás Lozano-Pérez, Leslie Pack Kaelbling

TL;DR

This work uses pretrained vision-language models to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images, and applies its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds.

Abstract

Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision-language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.

From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

TL;DR

This work uses pretrained vision-language models to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images, and applies its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds.

Abstract

Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision-language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.
Paper Structure (28 sections, 1 equation, 7 figures, 2 tables, 2 algorithms)

This paper contains 28 sections, 1 equation, 7 figures, 2 tables, 2 algorithms.

Figures (7)

  • Figure 1: Overview of pix2pred in the Cleanup domain: Given $6$ human demonstrations showcasing the effects of distinct skills (e.g., wiping, dumping) and a small initial predicate set, pix2pred invents new predicates (e.g., NoObjectsOnTop(?table)) and learns symbolic operators. At test time, it uses a search-based planner that operates over the learned model to solve a novel multi-step task in a visually distinct environment—e.g., retrieving an eraser from a bin, clearing an obstacle, wiping the table, and returning the eraser.
  • Figure 2: Labeling and proposal (Left) A small subset of the proposed predicates for the Cleanup domain. (Right) Truth values for these predicates (green is true and red is false) in a particular state, as determined by the VLM. This data is ultimately used to subselect the correct predicates.
  • Figure 3: Domains. Top row: train task example illustrations. Bottom row: test task example illustrations.
  • Figure 4: pix2pred versus baselines in simulation. Percent of evaluation tasks solved across all our simulated domains. All results are averaged over 5 seeds. Black bars denote standard deviations.
  • Figure 5: Example object annotations in Burger and Coffee.
  • ...and 2 more figures