Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning
Haozhe Ma, Zhengding Luo, Thanh Vinh Vo, Kuankuan Sima, Tze-Yun Leong
TL;DR
This work tackles the sparse-reward challenge in reinforcement learning by introducing Self-Adaptive Success Rate-based Reward Shaping (SASR). SASR generates auxiliary rewards $R^S(s)$ from success-rate statistics modeled as Beta distributions, where $r^S_i \sim \text{Beta}(N_S(s_i)+1,N_F(s_i)+1)$ and then scales via $f(\cdot)$ to a usable range, balancing exploration and exploitation over time. To avoid heavy modeling, SASR derives Beta parameters using Kernel Density Estimation (KDE) with Random Fourier Features (RFF), maintaining two buffers of successes and failures and applying Thompson-sampling-inspired sampling for robustness. Integrated with Soft Actor-Critic (SAC), SASR improves sample efficiency and convergence stability across extremely sparse, high-dimensional tasks, offering a practical, learning-free mechanism for reward shaping in continuous spaces. The approach demonstrates strong cross-domain performance and provides a foundation for extending reward shaping to denser reward settings and adaptive buffer strategies.
Abstract
Reward shaping is a technique in reinforcement learning that addresses the sparse-reward problem by providing more frequent and informative rewards. We introduce a self-adaptive and highly efficient reward shaping mechanism that incorporates success rates derived from historical experiences as shaped rewards. The success rates are sampled from Beta distributions, which dynamically evolve from uncertain to reliable values as data accumulates. Initially, the shaped rewards exhibit more randomness to encourage exploration, while over time, the increasing certainty enhances exploitation, naturally balancing exploration and exploitation. Our approach employs Kernel Density Estimation (KDE) combined with Random Fourier Features (RFF) to derive the Beta distributions, providing a computationally efficient, non-parametric, and learning-free solution for high-dimensional continuous state spaces. Our method is validated on various tasks with extremely sparse rewards, demonstrating notable improvements in sample efficiency and convergence stability over relevant baselines.
