Table of Contents
Fetching ...

Solving Sokoban using Hierarchical Reinforcement Learning with Landmarks

Sergey Pastukhov

TL;DR

A novel hierarchical reinforcement learning framework that performs top-down recursive planning via learned subgoals via learned subgoals is introduced, successfully applied to the complex combinatorial puzzle game Sokoban.

Abstract

We introduce a novel hierarchical reinforcement learning (HRL) framework that performs top-down recursive planning via learned subgoals, successfully applied to the complex combinatorial puzzle game Sokoban. Our approach constructs a six-level policy hierarchy, where each higher-level policy generates subgoals for the level below. All subgoals and policies are learned end-to-end from scratch, without any domain knowledge. Our results show that the agent can generate long action sequences from a single high-level call. While prior work has explored 2-3 level hierarchies and subgoal-based planning heuristics, we demonstrate that deep recursive goal decomposition can emerge purely from learning, and that such hierarchies can scale effectively to hard puzzle domains.

Solving Sokoban using Hierarchical Reinforcement Learning with Landmarks

TL;DR

A novel hierarchical reinforcement learning framework that performs top-down recursive planning via learned subgoals via learned subgoals is introduced, successfully applied to the complex combinatorial puzzle game Sokoban.

Abstract

We introduce a novel hierarchical reinforcement learning (HRL) framework that performs top-down recursive planning via learned subgoals, successfully applied to the complex combinatorial puzzle game Sokoban. Our approach constructs a six-level policy hierarchy, where each higher-level policy generates subgoals for the level below. All subgoals and policies are learned end-to-end from scratch, without any domain knowledge. Our results show that the agent can generate long action sequences from a single high-level call. While prior work has explored 2-3 level hierarchies and subgoal-based planning heuristics, we demonstrate that deep recursive goal decomposition can emerge purely from learning, and that such hierarchies can scale effectively to hard puzzle domains.

Paper Structure

This paper contains 17 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Sokoban example puzzle
  • Figure 2: Training the HalfWeg algorithm on Boxoban using models with a total of 689,496 free parameters and a hierarchy of 6 policies. The Y-axis shows the percentage of solved Boxoban validation test instances (unseen during training) with 36 targets
  • Figure 3: Impact of model sizes ($MA$ and $MS$) and the effect of feature removal. Both plots show training curves for Sokoban 6x6 with 3 boxes.
  • Figure 4: Example of envisioned landmark sequences for a generated Sokoban puzzle with 8 boxes
  • Figure 5: Tree of landmarks envisioned by the Boxoban model (policy $PL_5$) to solve a Boxoban level.