A tale of two goals: leveraging sequentiality in multi-goal scenarios
Olivier Serris, Stéphane Doncieux, Olivier Sigaud
TL;DR
The paper tackles the challenge of following sequences of intermediate goals in hierarchical RL when intermediate goals can be achieved in configurations that hinder progress to final goals. It introduces two MDP formulations, $M_{gseq}$ and $M_{2G}$, that condition the low-level goal-conditioned policy on current plus final goals or current plus the next two goals, respectively, and integrates an expert planner with TD3+HER learning. Empirical results across navigation and pole-balancing tasks show that the two-goal formulation, $M_{2G}$, generally achieves greater stability and faster learning than conditioning on the next and final goals, with $M_{gseq}$ providing a viable but slower alternative due to broader goal diversity. The findings highlight the importance of horizon-tuned, multi-goal conditioning for reliable goal chaining, while also noting the limitation of relying on a fixed planner and suggesting future work to jointly learn or adapt the planner with the learner.
Abstract
Several hierarchical reinforcement learning methods leverage planning to create a graph or sequences of intermediate goals, guiding a lower-level goal-conditioned (GC) policy to reach some final goals. The low-level policy is typically conditioned on the current goal, with the aim of reaching it as quickly as possible. However, this approach can fail when an intermediate goal can be reached in multiple ways, some of which may make it impossible to continue toward subsequent goals. To address this issue, we introduce two instances of Markov Decision Process (MDP) where the optimization objective favors policies that not only reach the current goal but also subsequent ones. In the first, the agent is conditioned on both the current and final goals, while in the second, it is conditioned on the next two goals in the sequence. We conduct a series of experiments on navigation and pole-balancing tasks in which sequences of intermediate goals are given. By evaluating policies trained with TD3+HER on both the standard GC-MDP and our proposed MDPs, we show that, in most cases, conditioning on the next two goals improves stability and sample efficiency over other approaches.
