Table of Contents
Fetching ...

A tale of two goals: leveraging sequentiality in multi-goal scenarios

Olivier Serris, Stéphane Doncieux, Olivier Sigaud

TL;DR

The paper tackles the challenge of following sequences of intermediate goals in hierarchical RL when intermediate goals can be achieved in configurations that hinder progress to final goals. It introduces two MDP formulations, $M_{gseq}$ and $M_{2G}$, that condition the low-level goal-conditioned policy on current plus final goals or current plus the next two goals, respectively, and integrates an expert planner with TD3+HER learning. Empirical results across navigation and pole-balancing tasks show that the two-goal formulation, $M_{2G}$, generally achieves greater stability and faster learning than conditioning on the next and final goals, with $M_{gseq}$ providing a viable but slower alternative due to broader goal diversity. The findings highlight the importance of horizon-tuned, multi-goal conditioning for reliable goal chaining, while also noting the limitation of relying on a fixed planner and suggesting future work to jointly learn or adapt the planner with the learner.

Abstract

Several hierarchical reinforcement learning methods leverage planning to create a graph or sequences of intermediate goals, guiding a lower-level goal-conditioned (GC) policy to reach some final goals. The low-level policy is typically conditioned on the current goal, with the aim of reaching it as quickly as possible. However, this approach can fail when an intermediate goal can be reached in multiple ways, some of which may make it impossible to continue toward subsequent goals. To address this issue, we introduce two instances of Markov Decision Process (MDP) where the optimization objective favors policies that not only reach the current goal but also subsequent ones. In the first, the agent is conditioned on both the current and final goals, while in the second, it is conditioned on the next two goals in the sequence. We conduct a series of experiments on navigation and pole-balancing tasks in which sequences of intermediate goals are given. By evaluating policies trained with TD3+HER on both the standard GC-MDP and our proposed MDPs, we show that, in most cases, conditioning on the next two goals improves stability and sample efficiency over other approaches.

A tale of two goals: leveraging sequentiality in multi-goal scenarios

TL;DR

The paper tackles the challenge of following sequences of intermediate goals in hierarchical RL when intermediate goals can be achieved in configurations that hinder progress to final goals. It introduces two MDP formulations, and , that condition the low-level goal-conditioned policy on current plus final goals or current plus the next two goals, respectively, and integrates an expert planner with TD3+HER learning. Empirical results across navigation and pole-balancing tasks show that the two-goal formulation, , generally achieves greater stability and faster learning than conditioning on the next and final goals, with providing a viable but slower alternative due to broader goal diversity. The findings highlight the importance of horizon-tuned, multi-goal conditioning for reliable goal chaining, while also noting the limitation of relying on a fixed planner and suggesting future work to jointly learn or adapt the planner with the learner.

Abstract

Several hierarchical reinforcement learning methods leverage planning to create a graph or sequences of intermediate goals, guiding a lower-level goal-conditioned (GC) policy to reach some final goals. The low-level policy is typically conditioned on the current goal, with the aim of reaching it as quickly as possible. However, this approach can fail when an intermediate goal can be reached in multiple ways, some of which may make it impossible to continue toward subsequent goals. To address this issue, we introduce two instances of Markov Decision Process (MDP) where the optimization objective favors policies that not only reach the current goal but also subsequent ones. In the first, the agent is conditioned on both the current and final goals, while in the second, it is conditioned on the next two goals in the sequence. We conduct a series of experiments on navigation and pole-balancing tasks in which sequences of intermediate goals are given. By evaluating policies trained with TD3+HER on both the standard GC-MDP and our proposed MDPs, we show that, in most cases, conditioning on the next two goals improves stability and sample efficiency over other approaches.

Paper Structure

This paper contains 21 sections, 8 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: The environments used in our experiments. In Dubins Hallway, the agent is evaluated on five goals, starting in the main corridor and aiming for each goal shown as a red circle. In GC-Cartpole, the agent is evaluated on the two goals farthest from the center, shown as red rectangles. In PointMaze: Serp3, the agent, shown as a green ball, is evaluated only on the hardest goal.
  • Figure 2: (a) Comparison of Success Rate for each methods. Evaluation is performed 10 times every 2K steps, with results reported as the mean 95% confidence interval over 10 seeds. (b): Set of trajectories of trained agents on Dubins Hallway. (c): Value function over episodes matching trajectories in (b). Our sequential approaches outperform the myopic and non-sequential agents. The $M_{gseq}$-td3 agent struggles to propagate value over full episodes.
  • Figure 3: (a) Top: Comparison of success rates for each method. Evaluation is performed 10 times every 2K steps, with results reported as the mean and 95% confidence interval over 10 runs. For all runs, metrics are smoothed using a moving average with a window size of 5 to increase readability. (a) Bottom: Time required to reach the goal state, considering only successful trajectories for computing the mean and confidence interval. (b) A set of trajectories generated by trained agents on PointMaze: Serp3. Again, $M_{2g}$-td3 seems to outperform $M_{gseq}$-td3.
  • Figure 4: Ablation study: Success rate for all previous environments. Evaluation is performed 10 times every 2K steps, with results reported as the mean and 95% confidence interval over 10 runs. For all runs, metrics are smoothed using a moving average with a window size of 5 to increase readability. Top: Ablation study for $M_{2g}$-td3, where the blue and red curves represent the removal of the first and second relabeling mechanisms, respectively. Bottom: Ablation study for $M_{gseq}$-td3, where the blue and red curves correspond to the removal of the first and final goal relabeling mechanisms, respectively.
  • Figure 5: For each environment, graph used to implement the expert planner.