Table of Contents
Fetching ...

Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach

Wenyun Li, Wenjie Huang, Chen Sun

TL;DR

The proposed approach performs reward shaping not only by utilizing non-zero-reward transitions but also by employing the Semi-Supervised Learning technique combined with a novel data augmentation to learn trajectory space representations from the majority of transitions, thereby improving the efficacy of reward shaping.

Abstract

In many real-world scenarios, reward signal for agents are exceedingly sparse, making it challenging to learn an effective reward function for reward shaping. To address this issue, the proposed approach in this paper performs reward shaping not only by utilizing non-zero-reward transitions but also by employing the \emph{Semi-Supervised Learning} (SSL) technique combined with a novel data augmentation to learn trajectory space representations from the majority of transitions, {i.e}., zero-reward transitions, thereby improving the efficacy of reward shaping. Experimental results in Atari and robotic manipulation demonstrate that our method outperforms supervised-based approaches in reward inference, leading to higher agent scores. Notably, in more sparse-reward environments, our method achieves up to twice the peak scores compared to supervised baselines. The proposed double entropy data augmentation enhances performance, showcasing a 15.8\% increase in best score over other augmentation methods

Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach

TL;DR

The proposed approach performs reward shaping not only by utilizing non-zero-reward transitions but also by employing the Semi-Supervised Learning technique combined with a novel data augmentation to learn trajectory space representations from the majority of transitions, thereby improving the efficacy of reward shaping.

Abstract

In many real-world scenarios, reward signal for agents are exceedingly sparse, making it challenging to learn an effective reward function for reward shaping. To address this issue, the proposed approach in this paper performs reward shaping not only by utilizing non-zero-reward transitions but also by employing the \emph{Semi-Supervised Learning} (SSL) technique combined with a novel data augmentation to learn trajectory space representations from the majority of transitions, {i.e}., zero-reward transitions, thereby improving the efficacy of reward shaping. Experimental results in Atari and robotic manipulation demonstrate that our method outperforms supervised-based approaches in reward inference, leading to higher agent scores. Notably, in more sparse-reward environments, our method achieves up to twice the peak scores compared to supervised baselines. The proposed double entropy data augmentation enhances performance, showcasing a 15.8\% increase in best score over other augmentation methods

Paper Structure

This paper contains 12 sections, 9 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: The figure illustrates how the SSRS method leverages both non-zero reward transitions and sparse reward transitions for reward shaping through a semi-supervised learning approach incorporating strong-weak data augmentation. The augmentation techniques include cutout, smooth, and entropy augmentation, among others (see Table. 2 in the Appendix for details).
  • Figure 2: (a)-(b) Reward curves of SSRS using C51 as the backbone, compared with the RCP algorithm. (c)-(e) Best score curves of SSRS under different data augmentation methods (using SAC as the backbone), compared with RCP. Note: To avoid cluttered curves from multiple variants, the best score—defined as the maximum score achieved from the beginning up to the current test episode—is plotted in (c)-(e). All curves represent the mean ± standard deviation over 5 seeds. The horizontal axis denotes test steps, and training is conducted for 1000k frames.
  • Figure 3: This figure shows the best score curve of SSRS variants and aforementioned baseline in the robotic manipulation environment FetchReach (mean $\pm$ std over 3 seeds). The horizontal axis denotes test steps, and training is conducted for 1000k frames.
  • Figure 4: This figure shows the consensus matrix of reward distribution in Hero after 1000 episodes iterations, which indicates that the distribution of transitions along the reward dimension in the trajectory space exhibits a certain clustering property. The cluster method used is Gaussian Mixture Models, and the number of iterations for consensus matrix is 100.
  • Figure 5: This figure illustrates the impact of the hyperparameters update probability $p_u$ and confidence threshold $\lambda$ on SSRS (using the Venture environment as an example). The lower x-axis represents the values of the confidence threshold, while the upper x-axis represents the values of the update probability.