Table of Contents
Fetching ...

Broadly-Exploring, Local-Policy Trees for Long-Horizon Task Planning

Brian Ichter, Pierre Sermanet, Corey Lynch

TL;DR

BELT presents a unified approach to long-horizon planning by marrying an RRT-inspired global search with a local, task-conditioned policy and a temporally extended task-conditioned model. It learns a latent task space from play data (Play-LMP) and uses a temporal distance classifier to bias expansions, enabling efficient exploration of sequential subtasks. Experimental results in a realistic Mujoco playground show BELT achieving robust long-horizon planning, outperforming baselines like CEM and single-goal policies, with higher success and feasibility when using a task-conditioned model. The work demonstrates the potential for scalable, real-world long-horizon manipulation and outlines avenues for replanning and dynamic environments.

Abstract

Long-horizon planning in realistic environments requires the ability to reason over sequential tasks in high-dimensional state spaces with complex dynamics. Classical motion planning algorithms, such as rapidly-exploring random trees, are capable of efficiently exploring large state spaces and computing long-horizon, sequential plans. However, these algorithms are generally challenged with complex, stochastic, and high-dimensional state spaces as well as in the presence of narrow passages, which naturally emerge in tasks that interact with the environment. Machine learning offers a promising solution for its ability to learn general policies that can handle complex interactions and high-dimensional observations. However, these policies are generally limited in horizon length. Our approach, Broadly-Exploring, Local-policy Trees (BELT), merges these two approaches to leverage the strengths of both through a task-conditioned, model-based tree search. BELT uses an RRT-inspired tree search to efficiently explore the state space. Locally, the exploration is guided by a task-conditioned, learned policy capable of performing general short-horizon tasks. This task space can be quite general and abstract; its only requirements are to be sampleable and to well-cover the space of useful tasks. This search is aided by a task-conditioned model that temporally extends dynamics propagation to allow long-horizon search and sequential reasoning over tasks. BELT is demonstrated experimentally to be able to plan long-horizon, sequential trajectories with a goal conditioned policy and generate plans that are robust.

Broadly-Exploring, Local-Policy Trees for Long-Horizon Task Planning

TL;DR

BELT presents a unified approach to long-horizon planning by marrying an RRT-inspired global search with a local, task-conditioned policy and a temporally extended task-conditioned model. It learns a latent task space from play data (Play-LMP) and uses a temporal distance classifier to bias expansions, enabling efficient exploration of sequential subtasks. Experimental results in a realistic Mujoco playground show BELT achieving robust long-horizon planning, outperforming baselines like CEM and single-goal policies, with higher success and feasibility when using a task-conditioned model. The work demonstrates the potential for scalable, real-world long-horizon manipulation and outlines avenues for replanning and dynamic environments.

Abstract

Long-horizon planning in realistic environments requires the ability to reason over sequential tasks in high-dimensional state spaces with complex dynamics. Classical motion planning algorithms, such as rapidly-exploring random trees, are capable of efficiently exploring large state spaces and computing long-horizon, sequential plans. However, these algorithms are generally challenged with complex, stochastic, and high-dimensional state spaces as well as in the presence of narrow passages, which naturally emerge in tasks that interact with the environment. Machine learning offers a promising solution for its ability to learn general policies that can handle complex interactions and high-dimensional observations. However, these policies are generally limited in horizon length. Our approach, Broadly-Exploring, Local-policy Trees (BELT), merges these two approaches to leverage the strengths of both through a task-conditioned, model-based tree search. BELT uses an RRT-inspired tree search to efficiently explore the state space. Locally, the exploration is guided by a task-conditioned, learned policy capable of performing general short-horizon tasks. This task space can be quite general and abstract; its only requirements are to be sampleable and to well-cover the space of useful tasks. This search is aided by a task-conditioned model that temporally extends dynamics propagation to allow long-horizon search and sequential reasoning over tasks. BELT is demonstrated experimentally to be able to plan long-horizon, sequential trajectories with a goal conditioned policy and generate plans that are robust.

Paper Structure

This paper contains 17 sections, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: BELT plans long-horizon, sequential trajectories via a task-conditioned tree and task-model.
  • Figure 2: BELT with Play Latent Motor Plans lynch2020learning learns a general, goal-conditioned policy as well as a model from teleoperation data. Given a long-horizon task, BELT translates this into an RRT-inspired, model-based tree search through the space, where each edge represents a single task sampled from demonstration data. Trajectories are verified successful via a trajectory-wise success check on the model and executed by the policy.
  • Figure 3: Broadly-Exploring, Local-policy Trees with a Play-LMP policy (Section \ref{['sec:play']}).
  • Figure 4: (\ref{['fig:bias']}) shows the bias used by BELT to choose the tree state $x_\text{expand}$ once $x_\text{sample}$ has been sampled. The blue region shows how the temporal distance between states may differ from the L2 distance, necessitating learning a temporal distance classifier. The choice between state 1 and 3 demonstrates the bias towards lower cost paths: though state 1 is closer to $x_\text{sample}$, node 3 has a much lower cost to come, and thus state 3 is selected. (\ref{['fig:models']}-\ref{['fig:models_data']}) shows the two types of models used in this work, an action- and a task-conditioned model. The action-conditioned model often exhibits compounding errors as it is recursively applied along the trajectory, while the task-conditioned model avoids this by temporally extending the prediction and conditioning on the fixed task for the edge. This compounding error can be seen in the second task (block lifting) where the action-model becomes unstable and the end effector flails (https://youtu.be/zCJpNPn0BZQ?t=162).
  • Figure 5: Plans from BELT, demonstrating its ability to plan long-horizon, sequential trajectories (https://youtu.be/zCJpNPn0BZQ?t=193).
  • ...and 6 more figures