Table of Contents
Fetching ...

Proposing Hierarchical Goal-Conditioned Policy Planning in Multi-Goal Reinforcement Learning

Gavin B. Rens

TL;DR

HGCPP addresses the challenge of learning multiple long-horizon goals under sparse rewards by integrating goal-conditioned policies with hierarchical RL and Monte Carlo Tree Search planning. The framework maintains a single, evolving plan-tree of high-level actions (HLAs) built from short GCPs, enabling reuse of skills and faster reasoning through planning with HLAs rather than primitive actions. Key innovations include the CGCP formalism, a novel expansion and goal-sampling strategy, and a propagation scheme that updates GCP values and non-GCP HLAs along the plan-tree. The approach aims to improve sample efficiency, exploration, and planning in complex, multi-goal domains, with modular components that can leverage standard RL algorithms and neural approximators. If validated, HGCPP could offer a flexible blueprint for scalable, planning-informed multi-goal robotics and AI systems.

Abstract

Humanoid robots must master numerous tasks with sparse rewards, posing a challenge for reinforcement learning (RL). We propose a method combining RL and automated planning to address this. Our approach uses short goal-conditioned policies (GCPs) organized hierarchically, with Monte Carlo Tree Search (MCTS) planning using high-level actions (HLAs). Instead of primitive actions, the planning process generates HLAs. A single plan-tree, maintained during the agent's lifetime, holds knowledge about goal achievement. This hierarchy enhances sample efficiency and speeds up reasoning by reusing HLAs and anticipating future actions. Our Hierarchical Goal-Conditioned Policy Planning (HGCPP) framework uniquely integrates GCPs, MCTS, and hierarchical RL, potentially improving exploration and planning in complex tasks.

Proposing Hierarchical Goal-Conditioned Policy Planning in Multi-Goal Reinforcement Learning

TL;DR

HGCPP addresses the challenge of learning multiple long-horizon goals under sparse rewards by integrating goal-conditioned policies with hierarchical RL and Monte Carlo Tree Search planning. The framework maintains a single, evolving plan-tree of high-level actions (HLAs) built from short GCPs, enabling reuse of skills and faster reasoning through planning with HLAs rather than primitive actions. Key innovations include the CGCP formalism, a novel expansion and goal-sampling strategy, and a propagation scheme that updates GCP values and non-GCP HLAs along the plan-tree. The approach aims to improve sample efficiency, exploration, and planning in complex, multi-goal domains, with modular components that can leverage standard RL algorithms and neural approximators. If validated, HGCPP could offer a flexible blueprint for scalable, planning-informed multi-goal robotics and AI systems.

Abstract

Humanoid robots must master numerous tasks with sparse rewards, posing a challenge for reinforcement learning (RL). We propose a method combining RL and automated planning to address this. Our approach uses short goal-conditioned policies (GCPs) organized hierarchically, with Monte Carlo Tree Search (MCTS) planning using high-level actions (HLAs). Instead of primitive actions, the planning process generates HLAs. A single plan-tree, maintained during the agent's lifetime, holds knowledge about goal achievement. This hierarchy enhances sample efficiency and speeds up reasoning by reusing HLAs and anticipating future actions. Our Hierarchical Goal-Conditioned Policy Planning (HGCPP) framework uniquely integrates GCPs, MCTS, and hierarchical RL, potentially improving exploration and planning in complex tasks.
Paper Structure (20 sections, 11 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 20 sections, 11 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Complete plan- tree corresponding to the maze grid-world. Note that HLA7 is composed of HLA1, HLA3 and HLA6, in that order.
  • Figure 2: Maze grid-world with three main desired goals, $G_1$, $G_2$ and $G_3$, and their waypoints as desired sub-goals. Blue dots indicate endpoints of GCPs; blue dots are also behavioral goals. Arrows show typical trajectories of GCPs.
  • Figure 3: Four successive plan-trees. Top left: plan-tree after generating six GCPs. Middle left: plan-tree after generating twelve GCPs. Top right: plan-tree after generating eighteen GCPs. The two larger diagrams show which linked GCPs form higher-level HLAs. Bottom: Complete plan-tree corresponding to the maze grid-world. Note that HLA7 is composed of HLA1, HLA3 and HLA6, in that order.
  • Figure 4: Execution process for a robot to achieve goal $g$ starting in state $s$.