Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

Dillon Z. Chen; Till Hofmann; Toryn Q. Klassen; Sheila A. McIlraith

Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

Dillon Z. Chen, Till Hofmann, Toryn Q. Klassen, Sheila A. McIlraith

Abstract

We tackle the challenge of building embodied AI agents that can reliably solve long-horizon planning problems. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low-level (LL), continuous environments. Yet, it remains a difficult endeavour to generate long-horizon plans from imitation learning alone. In contrast, high-level (HL), symbolic abstractions facilitate efficient and interpretable long-horizon planning. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long-horizon planning. We realise this idea via \emph{bilevel policies} of the form $(π^{\mathrm{hl}}, π^{\mathrm{ll}})$, consisting of a neural policy $π^{\mathrm{ll}}$ learned from LL demonstrations, and an HL symbolic policy $π^{\mathrm{hl}}$ that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end-to-end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON's HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: https://dillonzchen.github.io/bison

Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

Abstract

, consisting of a neural policy

learned from LL demonstrations, and an HL symbolic policy

that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end-to-end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON's HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: https://dillonzchen.github.io/bison

Paper Structure (63 sections, 3 theorems, 13 equations, 8 figures, 3 tables, 5 algorithms)

This paper contains 63 sections, 3 theorems, 13 equations, 8 figures, 3 tables, 5 algorithms.

Introduction
Related Work
Planning with Abstractions and Hierarchical Reinforcement Learning
Task and Motion Planning
Problem Statement
Bilevel Planning
Downward Refinement Property
LL Representation: Object-Centric, Ego-Centric Viewpoints
HL Representation: Relational State and Action Abstractions
Where do $\mathcal{D}$ and $\mathcal{L}$ come from?
Problem Statement Summary
Learning Bilevel Policies for Bilevel Planning
Learning HL Policies via Goal Regression and Inductive Generalisation
HL policy representation
Step 1: construct HL traces
...and 48 more sections

Key Result

Theorem 1

Let $\mathcal{D} = \langle \mathcal{P}, \mathcal{A} \rangle$ be an HL domain, $\mathcal{L}$ a labelling function, and $C \in \mathbb{N}$. There exists a finite dataset $\mathbb{T}$ such that the HL policy learned from $\mathbb{T}$ via alg:hl-policy-learning solves any HL planning problem $\mathbf{P}

Figures (8)

Figure 1: Top Left -- inputs for learning and executing bilevel policies: a domain theory $\mathcal{D}$, a labelling function $\mathcal{L}$ that maps observations to state abstractions, and LL demos with HL goals. Bottom Left -- bilevel policy learning: LL demos induce HL demos via $\mathcal{L}$, and LL/HL policies are separately learned from LL/HL demos. Right -- bilevel policy execution: state abstractions $\mathit{s}^{\mathrm{hl}}$ are computed from observations $\mathbf{s}^{\mathrm{ll}}$ via $\mathcal{L}$ to propose HL actions $\mathit{a}^{\mathrm{hl}}$ which in turn help propose LL actions $\mathbf{a}^{\mathrm{ll}}$.
Figure 2: The HL policy $\pi^{\mathrm{hl}}(\mathit{a}^{\mathrm{hl}} \mid \mathit{s}^{\mathrm{hl}}, \mathit{g}^{\mathrm{hl}})$ learning process. Step 1: we use the labelling function $\mathcal{L}$ to construct HL traces from the LL demonstrations paired with HL goals. Step 2: we utilise goal regression to extract condition-action rules from the HL traces and goals (underlined). Step 3: we inductively generalise the rules by replacing objects with variables to produce symbolic policies.
Figure 3: The LL policy $\pi^{\mathrm{ll}}(\mathbf{a}^{\mathrm{ll}} \mid \mathbf{s}^{\mathrm{ll}}, \mathit{a}^{\mathrm{hl}}, \mathit{g}^{\mathrm{hl}})$ represented by a GNN. In this example, the input action is $\mathit{a}^{\mathrm{hl}} = \mathit{pick}(\textit{obj}, \textit{loc})$, and the resulting output is $\mathbf{a}^{\mathrm{ll}}$. Solid lines represent graph edges, and dashed lines represent how information is passed. Bold font indicates Euclidean vectors.
Figure 4: Median (line) and range (shaded) of success rate ($\uparrow$) of methods across number of objects and environments. Results for VLA baselines are omitted as they do not complete any tasks.
Figure 5: Success rate ($\uparrow$) vs. time ($\downarrow$) for training and inference with 10 objects.
...and 3 more figures

Theorems & Definitions (10)

Definition 1: Nondeterministic Downward Refinement Property
Example 1: Pick and Place Domain
Example 2
Theorem 1
Definition 2: Goal Independence
Definition 3: Object-Renaming Equivalence
Proposition 1
proof
Theorem 1
proof

Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

Abstract

Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

Authors

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (10)