Table of Contents
Fetching ...

Goal Exploration via Adaptive Skill Distribution for Goal-Conditioned Reinforcement Learning

Lisheng Wu, Ke Chen

TL;DR

This work tackles exploration in sparse-reward, long-horizon goal-conditioned RL (GCRL) by introducing GEASD, a framework that adaptively distributes a predefined skill set to exploit environmental structure. GEASD builds a structural representation from skill value functions and constructs a Boltzmann-style skill distribution with a dynamic temperature that modulates exploration based on local entropy within a contextual horizon, guiding deep exploration via Skill-based Local Entropy-Maximization Pattern (SLEMP). Intrinsic rewards quantify local entropy changes, enabling the learning of skill-value functions that estimate entropy gains, while a two-stage Goal Exploration Strategy leverages both sub-goal novelty and adaptive skill-driven exploration. Theoretical analysis supports the Boltzmann form for the optimal skill distribution under reasonable assumptions, and experiments on PointMaze-Spiral and AntMaze demonstrate faster and more robust exploration and transfer to unseen mazes compared with OMEGA and GEAPS, with ablations highlighting the benefits of dynamic temperature and action-history in context. Overall, GEASD advances deep exploration in GCRL by aligning exploration objectives with environmental structure through learned skill distributions, offering improved efficiency and generalization in sparse, long-horizon tasks.

Abstract

Exploration efficiency poses a significant challenge in goal-conditioned reinforcement learning (GCRL) tasks, particularly those with long horizons and sparse rewards. A primary limitation to exploration efficiency is the agent's inability to leverage environmental structural patterns. In this study, we introduce a novel framework, GEASD, designed to capture these patterns through an adaptive skill distribution during the learning process. This distribution optimizes the local entropy of achieved goals within a contextual horizon, enhancing goal-spreading behaviors and facilitating deep exploration in states containing familiar structural patterns. Our experiments reveal marked improvements in exploration efficiency using the adaptive skill distribution compared to a uniform skill distribution. Additionally, the learned skill distribution demonstrates robust generalization capabilities, achieving substantial exploration progress in unseen tasks containing similar local structures.

Goal Exploration via Adaptive Skill Distribution for Goal-Conditioned Reinforcement Learning

TL;DR

This work tackles exploration in sparse-reward, long-horizon goal-conditioned RL (GCRL) by introducing GEASD, a framework that adaptively distributes a predefined skill set to exploit environmental structure. GEASD builds a structural representation from skill value functions and constructs a Boltzmann-style skill distribution with a dynamic temperature that modulates exploration based on local entropy within a contextual horizon, guiding deep exploration via Skill-based Local Entropy-Maximization Pattern (SLEMP). Intrinsic rewards quantify local entropy changes, enabling the learning of skill-value functions that estimate entropy gains, while a two-stage Goal Exploration Strategy leverages both sub-goal novelty and adaptive skill-driven exploration. Theoretical analysis supports the Boltzmann form for the optimal skill distribution under reasonable assumptions, and experiments on PointMaze-Spiral and AntMaze demonstrate faster and more robust exploration and transfer to unseen mazes compared with OMEGA and GEAPS, with ablations highlighting the benefits of dynamic temperature and action-history in context. Overall, GEASD advances deep exploration in GCRL by aligning exploration objectives with environmental structure through learned skill distributions, offering improved efficiency and generalization in sparse, long-horizon tasks.

Abstract

Exploration efficiency poses a significant challenge in goal-conditioned reinforcement learning (GCRL) tasks, particularly those with long horizons and sparse rewards. A primary limitation to exploration efficiency is the agent's inability to leverage environmental structural patterns. In this study, we introduce a novel framework, GEASD, designed to capture these patterns through an adaptive skill distribution during the learning process. This distribution optimizes the local entropy of achieved goals within a contextual horizon, enhancing goal-spreading behaviors and facilitating deep exploration in states containing familiar structural patterns. Our experiments reveal marked improvements in exploration efficiency using the adaptive skill distribution compared to a uniform skill distribution. Additionally, the learned skill distribution demonstrates robust generalization capabilities, achieving substantial exploration progress in unseen tasks containing similar local structures.
Paper Structure (48 sections, 4 theorems, 35 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 48 sections, 4 theorems, 35 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Consider a set of skills denoted by $\mathcal{Z}$, where each skill $\bm{z}_i \in \mathcal{Z}$ uniquely covers a portion of the achieved goals, thereby ensuring $H(\mathcal{Z}|s_t, \bm{\Phi}(h^k_{t+k})) = 0$. The optimal skill distribution, conditioned on the historical trajectory $h^C_t$ and aiming where the energy function $E^*(\bm{z})$ associated with skill $z$ is quantified by the resulting en

Figures (9)

  • Figure 1: Visualizing decision-making in a maze grid: Dashed arrows indicate potential future actions, while solid arrows trace the agent's historical trajectory. In (a), the agent considers all potential actions equally within its action space, not yet incorporating information from the environmental structure. In (b), it strategically exploits the local environmental structure to selectively focus on actions that would broaden the spread of achieved goals. In (c), the agent's previously established trajectory makes actions leading downwards appear less favorable. In (d), the established leftward trajectory implies that actions moving to the right may be disadvantageous.
  • Figure 2: Illustration of Our Learning Pipeline: The leftmost figure visualizes structural information, including historical paths and local layouts, as indicated by $h^C_t$. It also showcases potential trajectories for the four skills characterized by $\{\bm{z}_i\}^4_{i=1}$. Following the leftmost figure, we present the structural representation $Q(h^C_t, \mathcal{Z})$, derived from the aggregated value functions of skills. Subsequently, we derive a Boltzmann distribution $p(\bm{z}|h^C_t)$ based on these representations. Consequently, skills characterized by lower value functions are assigned minimal probabilities, leading to their marginalization, as depicted in the rightmost figure.
  • Figure 3: The experimental environments.
  • Figure 4: The test success on the desired goal distribution and the empirical entropy of achieved goals, throughout training on the two environments, for both the baselines and our models.
  • Figure 5: Visualization of the final goals achieved across historical episodes in (a) PointMaze-Spiral and (b) AntMaze-U, with the training evolution process depicted through heatmaps.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Definition 1
  • Definition 2
  • Proposition 1
  • Proposition 2
  • Proposition 1
  • proof
  • Proposition 2
  • proof