Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning

Haozhe Ma; Zhengding Luo; Thanh Vinh Vo; Kuankuan Sima; Tze-Yun Leong

Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning

Haozhe Ma, Zhengding Luo, Thanh Vinh Vo, Kuankuan Sima, Tze-Yun Leong

TL;DR

This work tackles the sparse-reward challenge in reinforcement learning by introducing Self-Adaptive Success Rate-based Reward Shaping (SASR). SASR generates auxiliary rewards $R^S(s)$ from success-rate statistics modeled as Beta distributions, where $r^S_i \sim \text{Beta}(N_S(s_i)+1,N_F(s_i)+1)$ and then scales via $f(\cdot)$ to a usable range, balancing exploration and exploitation over time. To avoid heavy modeling, SASR derives Beta parameters using Kernel Density Estimation (KDE) with Random Fourier Features (RFF), maintaining two buffers of successes and failures and applying Thompson-sampling-inspired sampling for robustness. Integrated with Soft Actor-Critic (SAC), SASR improves sample efficiency and convergence stability across extremely sparse, high-dimensional tasks, offering a practical, learning-free mechanism for reward shaping in continuous spaces. The approach demonstrates strong cross-domain performance and provides a foundation for extending reward shaping to denser reward settings and adaptive buffer strategies.

Abstract

Reward shaping is a technique in reinforcement learning that addresses the sparse-reward problem by providing more frequent and informative rewards. We introduce a self-adaptive and highly efficient reward shaping mechanism that incorporates success rates derived from historical experiences as shaped rewards. The success rates are sampled from Beta distributions, which dynamically evolve from uncertain to reliable values as data accumulates. Initially, the shaped rewards exhibit more randomness to encourage exploration, while over time, the increasing certainty enhances exploitation, naturally balancing exploration and exploitation. Our approach employs Kernel Density Estimation (KDE) combined with Random Fourier Features (RFF) to derive the Beta distributions, providing a computationally efficient, non-parametric, and learning-free solution for high-dimensional continuous state spaces. Our method is validated on various tasks with extremely sparse rewards, demonstrating notable improvements in sample efficiency and convergence stability over relevant baselines.

Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning

TL;DR

This work tackles the sparse-reward challenge in reinforcement learning by introducing Self-Adaptive Success Rate-based Reward Shaping (SASR). SASR generates auxiliary rewards

from success-rate statistics modeled as Beta distributions, where

and then scales via

to a usable range, balancing exploration and exploitation over time. To avoid heavy modeling, SASR derives Beta parameters using Kernel Density Estimation (KDE) with Random Fourier Features (RFF), maintaining two buffers of successes and failures and applying Thompson-sampling-inspired sampling for robustness. Integrated with Soft Actor-Critic (SAC), SASR improves sample efficiency and convergence stability across extremely sparse, high-dimensional tasks, offering a practical, learning-free mechanism for reward shaping in continuous spaces. The approach demonstrates strong cross-domain performance and provides a foundation for extending reward shaping to denser reward settings and adaptive buffer strategies.

Abstract

Paper Structure (27 sections, 20 equations, 7 figures, 14 tables, 1 algorithm)

This paper contains 27 sections, 20 equations, 7 figures, 14 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Methodology
Self-Adaptive Success Rate Sampling
Highly Efficient Beta Distribution Derivation
Implementation Details
Time and Space Complexity of SASR
The SASR Mechanism for RL agents
Experiments
Comparison and Discussion
Effect of Self-Adaptive Success Rate Sampling
Ablation Study
Conclusion and Discussion
Appendix
...and 12 more sections

Figures (7)

Figure 1: A schematic diagram of the self-adaptive success rate based reward shaping mechanism. KDE: Kernel Density Estimation; RFF: Random Fourier Features.
Figure 2: MuJoCo, robotic, Atari games and physical simulation tasks in our experiments. Detailed descriptions and the environmental reward models of each task are provided in Appendix \ref{['sec:appendix-environments']}.
Figure 3: The learning performance of SASR compared with the baselines.
Figure 4: Distributions of the shaped rewards over the height of the ant robot in the AntStand task at different training stages. Red diamonds represent the estimated success rate, while the blue polylines show the actual shaped rewards sampled from the Beta distribution.
Figure 5: The density of visited states in the MountainCar task for four training periods.
...and 2 more figures

Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning

TL;DR

Abstract

Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)