Landmark Guided Active Exploration with State-specific Balance Coefficient

Fei Cui; Jiaojiao Fang; Mengke Yang; Guizhong Liu

Landmark Guided Active Exploration with State-specific Balance Coefficient

Fei Cui, Jiaojiao Fang, Mengke Yang, Guizhong Liu

TL;DR

This work tackles exploration in goal-conditioned hierarchical reinforcement learning by introducing landmark-guided exploration. It defines a prospect measure via landmark-based planning in the goal space and combines it with a novelty signal, governed by a state-specific balance coefficient $\alpha$ to balance exploration and guidance toward the final goal. The resulting LESC framework demonstrates superior sample efficiency and performance on challenging Mujoco tasks, with ablations confirming the complementary roles of prospect, novelty, and dynamic balancing. This approach offers a principled way to leverage task-directed structure for more effective exploration in long-horizon RL problems.

Abstract

Goal-conditioned hierarchical reinforcement learning (GCHRL) decomposes long-horizon tasks into sub-tasks through a hierarchical framework and it has demonstrated promising results across a variety of domains. However, the high-level policy's action space is often excessively large, presenting a significant challenge to effective exploration and resulting in potentially inefficient training. In this paper, we design a measure of prospect for sub-goals by planning in the goal space based on the goal-conditioned value function. Building upon the measure of prospect, we propose a landmark-guided exploration strategy by integrating the measures of prospect and novelty which aims to guide the agent to explore efficiently and improve sample efficiency. In order to dynamically consider the impact of prospect and novelty on exploration, we introduce a state-specific balance coefficient to balance the significance of prospect and novelty. The experimental results demonstrate that our proposed exploration strategy significantly outperforms the baseline methods across multiple tasks.

Landmark Guided Active Exploration with State-specific Balance Coefficient

TL;DR

to balance exploration and guidance toward the final goal. The resulting LESC framework demonstrates superior sample efficiency and performance on challenging Mujoco tasks, with ablations confirming the complementary roles of prospect, novelty, and dynamic balancing. This approach offers a principled way to leverage task-directed structure for more effective exploration in long-horizon RL problems.

Abstract

Paper Structure (15 sections, 9 equations, 6 figures, 2 algorithms)

This paper contains 15 sections, 9 equations, 6 figures, 2 algorithms.

Introduction
Related Work
Hierarchical Reinforcement Learning
Subgoal Selection
Preliminaries
Method
Measures for Subgoals
State-specific Balance Coefficient
Hierarchical Exploration Strategy
Experiments
Experimental Setup
Comparative Analysis
Qualitative Analysis on Measures for Subgoals
Ablation Studies
Conclusion

Figures (6)

Figure 1: The framework of GCHRL. Here $\phi$ is the subgoal representation function that maps the state to the goal space. The hierarchical framework consists of a high-level policy and a low-level policy. The reward of the high-level policy is a sum of c (the low-level policy length) external rewards, while the reward of the low-level policy is the negative distance between the state and subgoal in the goal space.
Figure 2: landmark selection and prospect calculation process. The calculation of prospect involves four stages: 1) Sampling: An adequate number of sample points are randomly selected from the state space. Then, the FPS algorithm is employed to sample $n_{cov}$ landmark points. 2) Building a graph: The sampled landmarks, current position, and goal are used as nodes to build a graph. The edges of the graph represent the reachability between two nodes. 3) Path planing: Using the shortest path planning algorithm, a feasible path from the current position to the goal is determined based on the constructed graph. 4) Calculation: Landmark is sampled along the trajectory and selected as $l_{sel}$. The prospect of the subgoals (within the neighborhood of the current position) is then calculated based on the selected landmark.
Figure 3: Learning curves of LESC and baselines. (a) Point Maze. (b) Ant Push. (c) Ant Maze. (d) Ant Maze (W-shape). (e) Ant FourRooms. (f) Ant Maze (Images). The x-axis represents the training time steps, while the y-axis represents the average success rate over 50 episodes. The experiments are evaluated for each algorithm using five different random seeds. The shaded area represents the $95\%$ confidence interval.
Figure 4: The visualization of subgoal measures in the AntMaze task at time step 150000 (up) and time step 300000 (down). The circular markers represent the candidate subgoal set sampled by the agent. The color intensity of the markers, ranging from red to blue, indicates the corresponding measure values.
Figure 5: Ablation studies on the components of LESC. The experiments are evaluated using five different random seeds.
...and 1 more figures

Landmark Guided Active Exploration with State-specific Balance Coefficient

TL;DR

Abstract

Landmark Guided Active Exploration with State-specific Balance Coefficient

Authors

TL;DR

Abstract

Table of Contents

Figures (6)