Table of Contents
Fetching ...

Subgoal-Guided Policy Heuristic Search with Learned Subgoals

Jake Tuero, Michael Buro, Levi H. S. Lelis

TL;DR

The paper tackles the challenge of sample-inefficient training in policy-guided tree search for deterministic single-agent problems by learning subgoal-based policies from both failed and successful searches. It introduces a hierarchical approach with a VQVAE-based subgoal generator and subgoal-conditioned low-level policies, tempered by a high-level subgoal policy, forming a final π^SG that guides search. The method preserves completeness guarantees and demonstrates superior sample efficiency across several domains, solving harder instances where prior methods struggle. It shows that online training from incomplete tubes of search data can outperform traditional Bootstrap approaches and reduce environment interactions without sacrificing solution quality.

Abstract

Policy tree search is a family of tree search algorithms that use a policy to guide the search. These algorithms provide guarantees on the number of expansions required to solve a given problem that are based on the quality of the policy. While these algorithms have shown promising results, the process in which they are trained requires complete solution trajectories to train the policy. Search trajectories are obtained during a trial-and-error search process. When the training problem instances are hard, learning can be prohibitively costly, especially when starting from a randomly initialized policy. As a result, search samples are wasted in failed attempts to solve these hard instances. This paper introduces a novel method for learning subgoal-based policies for policy tree search algorithms. The subgoals and policies conditioned on subgoals are learned from the trees that the search expands while attempting to solve problems, including the search trees of failed attempts. We empirically show that our policy formulation and training method improve the sample efficiency of learning a policy and heuristic function in this online setting.

Subgoal-Guided Policy Heuristic Search with Learned Subgoals

TL;DR

The paper tackles the challenge of sample-inefficient training in policy-guided tree search for deterministic single-agent problems by learning subgoal-based policies from both failed and successful searches. It introduces a hierarchical approach with a VQVAE-based subgoal generator and subgoal-conditioned low-level policies, tempered by a high-level subgoal policy, forming a final π^SG that guides search. The method preserves completeness guarantees and demonstrates superior sample efficiency across several domains, solving harder instances where prior methods struggle. It shows that online training from incomplete tubes of search data can outperform traditional Bootstrap approaches and reduce environment interactions without sacrificing solution quality.

Abstract

Policy tree search is a family of tree search algorithms that use a policy to guide the search. These algorithms provide guarantees on the number of expansions required to solve a given problem that are based on the quality of the policy. While these algorithms have shown promising results, the process in which they are trained requires complete solution trajectories to train the policy. Search trajectories are obtained during a trial-and-error search process. When the training problem instances are hard, learning can be prohibitively costly, especially when starting from a randomly initialized policy. As a result, search samples are wasted in failed attempts to solve these hard instances. This paper introduces a novel method for learning subgoal-based policies for policy tree search algorithms. The subgoals and policies conditioned on subgoals are learned from the trees that the search expands while attempting to solve problems, including the search trees of failed attempts. We empirically show that our policy formulation and training method improve the sample efficiency of learning a policy and heuristic function in this online setting.

Paper Structure

This paper contains 29 sections, 7 equations, 6 figures, 7 tables, 6 algorithms.

Figures (6)

  • Figure 1: (a) Tree Search. Policy tree search generates subgoals to use with $\pi^\text{SG}$. (b) Training (Non-Solution). (i) The underlying graph that induced the tree search is generated from the parent-child relationships found during search, which is used to create a hierarchy of cluster graphs using the Louvain algorithm. (ii) States $s_\text{cur}$ and $s_\text{tar}$ are sampled from neighbouring states in $G_i$ from the graphs created by the Louvain Algorithm. (iv) The resulting trajectory from $s_\text{cur}$ to $s_\text{tar}$ is used to update the subgoal generator and low-level policy. (c) Training (Solution). (i) The heuristic is updated using the solution trajectory. (ii) The solution trajectory is segmented. (iii) The subgoal generator is updated using the partial trajectory. (iv) The high-level and low-level policies are updated using the segmented trajectories.
  • Figure 2: The line is the average number outstanding problems for the respective number of expansions accumulated during training. Shaded regions show maximum and minimum outstanding problems across all seeds.
  • Figure 3: Reduction in search loss during training when using data from non-solution trees. The line represents the average and the shaded regions show maximum, minimum, and average outstanding problems across all seeds.
  • Figure 4: The environment domains used in the experiments. Box-World: The agent in black needs to open colored locks (right pixel) to receive a colored key (left pixel) until it gets to the goal (white). CraftWorld: The agent must create an iron pickaxe to get the gem so that they can craft a ring. BoulderDash: The agent must get the green key to unlock the room containing the diamond. Once the diamond is collect, the exit in the bottom-right room will open. Sokoban: The agent must push boxes to the goal locations without getting boxes stuck in corners. TSP: The agent (black) must visit each city (red) then return back to the first city (blue). The gray boxes are non-traversable obstacles.
  • Figure 5: Learning curves during training using the Bootstrap process for varying codebook sizes. The line is the average outstanding problems for the respective number of expansions accumulated during training. All training runs lie within the shaded regions.
  • ...and 1 more figures