Table of Contents
Fetching ...

HiER: Highlight Experience Replay for Boosting Off-Policy Reinforcement Learning Agents

Dániel Horváth, Jesús Bujalance Martín, Ferenc Gábor Erdős, Zoltán Istenes, Fabien Moutarde

TL;DR

This work tackles the difficulty of training off-policy reinforcement learning agents for robotics in continuous, high-dimensional, and sparse-reward environments without demonstrations. It introduces HiER, which adds a secondary highlight replay buffer to store and emphasize the most relevant experiences, and HiER+, which integrates a data-collection curriculum method (E2H-ISE) to further boost learning. Empirical results across 8 tasks on Panda-Gym, Fetch, and PointMaze benchmarks show that HiER and HiER+ consistently outperform strong baselines and even state-of-the-art variants, reducing the likelihood of getting stuck in local minima and enabling more reliable task success. The proposed approach provides a versatile, generalizable improvement to off-policy RL in robotics, with potential for broader applicability and future exploration of more sophisticated curriculum strategies.

Abstract

Even though reinforcement-learning-based algorithms achieved superhuman performance in many domains, the field of robotics poses significant challenges as the state and action spaces are continuous, and the reward function is predominantly sparse. Furthermore, on many occasions, the agent is devoid of access to any form of demonstration. Inspired by human learning, in this work, we propose a method named highlight experience replay (HiER) that creates a secondary highlight replay buffer for the most relevant experiences. For the weights update, the transitions are sampled from both the standard and the highlight experience replay buffer. It can be applied with or without the techniques of hindsight experience replay (HER) and prioritized experience replay (PER). Our method significantly improves the performance of the state-of-the-art, validated on 8 tasks of three robotic benchmarks. Furthermore, to exploit the full potential of HiER, we propose HiER+ in which HiER is enhanced with an arbitrary data collection curriculum learning method. Our implementation, the qualitative results, and a video presentation are available on the project site: http://www.danielhorvath.eu/hier/.

HiER: Highlight Experience Replay for Boosting Off-Policy Reinforcement Learning Agents

TL;DR

This work tackles the difficulty of training off-policy reinforcement learning agents for robotics in continuous, high-dimensional, and sparse-reward environments without demonstrations. It introduces HiER, which adds a secondary highlight replay buffer to store and emphasize the most relevant experiences, and HiER+, which integrates a data-collection curriculum method (E2H-ISE) to further boost learning. Empirical results across 8 tasks on Panda-Gym, Fetch, and PointMaze benchmarks show that HiER and HiER+ consistently outperform strong baselines and even state-of-the-art variants, reducing the likelihood of getting stuck in local minima and enabling more reliable task success. The proposed approach provides a versatile, generalizable improvement to off-policy RL in robotics, with potential for broader applicability and future exploration of more sophisticated curriculum strategies.

Abstract

Even though reinforcement-learning-based algorithms achieved superhuman performance in many domains, the field of robotics poses significant challenges as the state and action spaces are continuous, and the reward function is predominantly sparse. Furthermore, on many occasions, the agent is devoid of access to any form of demonstration. Inspired by human learning, in this work, we propose a method named highlight experience replay (HiER) that creates a secondary highlight replay buffer for the most relevant experiences. For the weights update, the transitions are sampled from both the standard and the highlight experience replay buffer. It can be applied with or without the techniques of hindsight experience replay (HER) and prioritized experience replay (PER). Our method significantly improves the performance of the state-of-the-art, validated on 8 tasks of three robotic benchmarks. Furthermore, to exploit the full potential of HiER, we propose HiER+ in which HiER is enhanced with an arbitrary data collection curriculum learning method. Our implementation, the qualitative results, and a video presentation are available on the project site: http://www.danielhorvath.eu/hier/.
Paper Structure (24 sections, 1 equation, 17 figures, 9 tables, 1 algorithm)

This paper contains 24 sections, 1 equation, 17 figures, 9 tables, 1 algorithm.

Figures (17)

  • Figure 1: The overview of HiER and HiER+. For every episode, the initial state is sampled from $\mu_0$. After every episode, the transitions are stored in $\mathcal{B}_{ser}$, and in case the $\lambda$ condition is fulfilled then in $\mathcal{B}_{hier}$ as well. For training, the transitions are sampled from both $\mathcal{B}_{ser}$ and $\mathcal{B}_{hier}$ according to the ratio $\xi$. For a detailed description, see Alg. \ref{['alg:HiER+']}.
  • Figure 2: Visualization of the effect of parameter $c$ on $\mu_0$ in a 2D case where state $s = [s_x,s_y]$. The initial state $s_0 = [s_{0,x},s_{0,y}]$ is sampled from the probability distribution $\mu_0(c)$.
  • Figure 3: HiER compared to the state-of-the-art across all tasks with 95% CIs. Both HiER version outperform their corresponding baseline. HiER [HER] yields the best performance in all metrics. The point estimates are presented in Tab. \ref{['tab:results_agg_alltasks']}
  • Figure 4: Performance profiles across all tasks with 95% CIs. Left: run-score distribution, right: average-score distribution. The red-dotted line shows the median values while the areas under the performance profiles correspond to the mean values (comparing with Tab. \ref{['tab:results_agg_alltasks']}, the average-score distribution needs to be examined). Both HiER and HiER [HER] have stochastic dominance over their corresponding baselines.
  • Figure 5: Probability of improvement of HiER versions compared to their corresponding baselines and themselves across all tasks with 95% CIs. The average probabilities from top to bottom are the following: 0.76, 0.88, and 0.85.
  • ...and 12 more figures