Table of Contents
Fetching ...

ROER: Regularized Optimal Experience Replay

Changling Li, Zhang-Wei Hong, Pulkit Agrawal, Divyansh Garg, Joni Pajarinen

TL;DR

This work explores the KL divergence as the regularizer and obtains a new form of prioritization scheme, the regularized optimal experience replay (ROER), and evaluates the proposed prioritization scheme with the Soft Actor-Critic algorithm in continuous control MuJoCo and DM Control benchmark tasks.

Abstract

Experience replay serves as a key component in the success of online reinforcement learning (RL). Prioritized experience replay (PER) reweights experiences by the temporal difference (TD) error empirically enhancing the performance. However, few works have explored the motivation of using TD error. In this work, we provide an alternative perspective on TD-error-based reweighting. We show the connections between the experience prioritization and occupancy optimization. By using a regularized RL objective with $f-$divergence regularizer and employing its dual form, we show that an optimal solution to the objective is obtained by shifting the distribution of off-policy data in the replay buffer towards the on-policy optimal distribution using TD-error-based occupancy ratios. Our derivation results in a new pipeline of TD error prioritization. We specifically explore the KL divergence as the regularizer and obtain a new form of prioritization scheme, the regularized optimal experience replay (ROER). We evaluate the proposed prioritization scheme with the Soft Actor-Critic (SAC) algorithm in continuous control MuJoCo and DM Control benchmark tasks where our proposed scheme outperforms baselines in 6 out of 11 tasks while the results of the rest match with or do not deviate far from the baselines. Further, using pretraining, ROER achieves noticeable improvement on difficult Antmaze environment where baselines fail, showing applicability to offline-to-online fine-tuning. Code is available at \url{https://github.com/XavierChanglingLi/Regularized-Optimal-Experience-Replay}.

ROER: Regularized Optimal Experience Replay

TL;DR

This work explores the KL divergence as the regularizer and obtains a new form of prioritization scheme, the regularized optimal experience replay (ROER), and evaluates the proposed prioritization scheme with the Soft Actor-Critic algorithm in continuous control MuJoCo and DM Control benchmark tasks.

Abstract

Experience replay serves as a key component in the success of online reinforcement learning (RL). Prioritized experience replay (PER) reweights experiences by the temporal difference (TD) error empirically enhancing the performance. However, few works have explored the motivation of using TD error. In this work, we provide an alternative perspective on TD-error-based reweighting. We show the connections between the experience prioritization and occupancy optimization. By using a regularized RL objective with divergence regularizer and employing its dual form, we show that an optimal solution to the objective is obtained by shifting the distribution of off-policy data in the replay buffer towards the on-policy optimal distribution using TD-error-based occupancy ratios. Our derivation results in a new pipeline of TD error prioritization. We specifically explore the KL divergence as the regularizer and obtain a new form of prioritization scheme, the regularized optimal experience replay (ROER). We evaluate the proposed prioritization scheme with the Soft Actor-Critic (SAC) algorithm in continuous control MuJoCo and DM Control benchmark tasks where our proposed scheme outperforms baselines in 6 out of 11 tasks while the results of the rest match with or do not deviate far from the baselines. Further, using pretraining, ROER achieves noticeable improvement on difficult Antmaze environment where baselines fail, showing applicability to offline-to-online fine-tuning. Code is available at \url{https://github.com/XavierChanglingLi/Regularized-Optimal-Experience-Replay}.
Paper Structure (21 sections, 36 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 21 sections, 36 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Measuring underestimation bias in the value estimates of SAC, SAC with PER and SAC with ROER of continuous control tasks in MuJoCo by the difference between the true values and the value estimates. True value is obtained by Monte Carlo returns. Value estimates and true values are averaged over 20 random seeds and the error bar represents 95% confidence interval.
  • Figure 2: Learning curves for the Antmaze tasks in Gym-Robotics with data from D4RL. Curves are averaged over 10 random seeds, where the shaded area represents the standard error of the average evaluation.
  • Figure 3: Convergence rate ($\lambda$), Gumbel loss clip (Grad Clip), loss temperature ($\beta$), Maximum exponential clip (Max Exp Clip), and minimum priority clip (Min Clip) Ablation for HalfCheetah-v2 over 5 random seeds. One parameter is changing while the rest are fixed. The default combination is [0.01, 7, 4, 50, 10] which is the set used in our final results. All curves are smoothed with Savitzky–Golay filter for visual clarity. The shaded region represents standard error which is favored in this case to separate the curves.
  • Figure 4: Convergence rate ($\lambda$), Gumbel loss clip (Grad Clip), loss temperature ($\beta$), Maximum exponential clip (Max Exp Clip), and minimum priority clip (Min Clip) Ablation for Hopper-stand over 5 random seeds. One parameter is changing while the rest are fixed. The default combination is [0.01, 7, 1, 10, 1] which is the set used in our final results. All curves are smoothed with Savitzky–Golay filter for visual clarity. The shaded region represents standard error which is favored in this case to separate the curves.
  • Figure 5: Gumbel loss clip (Grad Clip), loss temperature ($\beta$), and minimum priority clip (Min Clip) Ablation for Antmaze with Antmaze-umaze-diverse-v2 dataset over 5 random seeds. One parameter is changing while the rest are fixed. The default combination is [0.01, 7, 0.4, 50, 1] which is the set used in our final results. The shaded region represents standard error which is favored in this case to separate the curves.
  • ...and 3 more figures