Table of Contents
Fetching ...

Improving the Effectiveness of Potential-Based Reward Shaping in Reinforcement Learning

Henrik Müller, Daniel Kudenko

TL;DR

This work analyzes potential-based reward shaping (PBRS) in reinforcement learning, revealing that PBRS effectiveness is strongly influenced by the initial Q-values $Q_{init}$ and external rewards. It introduces a constant bias shift $b$ to the potential function, yielding a shifted potential $ abla_b(s)$ that can improve sample efficiency without altering policy preferences, and derives bounds and limitations for potential scale, particularly in terminal states. The authors show that continuous potentials have intrinsic limitations for correct shaping, and advocate exponential potentials to preserve incentives, supported by theoretical analysis and experiments in Gridworld, CartPole, and MountainCar, including both tabular and deep RL settings. Practically, the results offer actionable guidance on adjusting PBRS to leverage prior knowledge while maintaining policy invariance, with implications for faster learning in sparse-reward tasks and deep RL applications.

Abstract

Potential-based reward shaping is commonly used to incorporate prior knowledge of how to solve the task into reinforcement learning because it can formally guarantee policy invariance. As such, the optimal policy and the ordering of policies by their returns are not altered by potential-based reward shaping. In this work, we highlight the dependence of effective potential-based reward shaping on the initial Q-values and external rewards, which determine the agent's ability to exploit the shaping rewards to guide its exploration and achieve increased sample efficiency. We formally derive how a simple linear shift of the potential function can be used to improve the effectiveness of reward shaping without changing the encoded preferences in the potential function, and without having to adjust the initial Q-values, which can be challenging and undesirable in deep reinforcement learning. We show the theoretical limitations of continuous potential functions for correctly assigning positive and negative reward shaping values. We verify our theoretical findings empirically on Gridworld domains with sparse and uninformative reward functions, as well as on the Cart Pole and Mountain Car environments, where we demonstrate the application of our results in deep reinforcement learning.

Improving the Effectiveness of Potential-Based Reward Shaping in Reinforcement Learning

TL;DR

This work analyzes potential-based reward shaping (PBRS) in reinforcement learning, revealing that PBRS effectiveness is strongly influenced by the initial Q-values and external rewards. It introduces a constant bias shift to the potential function, yielding a shifted potential that can improve sample efficiency without altering policy preferences, and derives bounds and limitations for potential scale, particularly in terminal states. The authors show that continuous potentials have intrinsic limitations for correct shaping, and advocate exponential potentials to preserve incentives, supported by theoretical analysis and experiments in Gridworld, CartPole, and MountainCar, including both tabular and deep RL settings. Practically, the results offer actionable guidance on adjusting PBRS to leverage prior knowledge while maintaining policy invariance, with implications for faster learning in sparse-reward tasks and deep RL applications.

Abstract

Potential-based reward shaping is commonly used to incorporate prior knowledge of how to solve the task into reinforcement learning because it can formally guarantee policy invariance. As such, the optimal policy and the ordering of policies by their returns are not altered by potential-based reward shaping. In this work, we highlight the dependence of effective potential-based reward shaping on the initial Q-values and external rewards, which determine the agent's ability to exploit the shaping rewards to guide its exploration and achieve increased sample efficiency. We formally derive how a simple linear shift of the potential function can be used to improve the effectiveness of reward shaping without changing the encoded preferences in the potential function, and without having to adjust the initial Q-values, which can be challenging and undesirable in deep reinforcement learning. We show the theoretical limitations of continuous potential functions for correctly assigning positive and negative reward shaping values. We verify our theoretical findings empirically on Gridworld domains with sparse and uninformative reward functions, as well as on the Cart Pole and Mountain Car environments, where we demonstrate the application of our results in deep reinforcement learning.

Paper Structure

This paper contains 20 sections, 17 equations, 4 figures.

Figures (4)

  • Figure 1: Average length of evaluation runs (with $\epsilon=0.05$) on a 25x25 Gridworld with potential-based reward shaping where $\Phi(s) = V^*(s)$.
  • Figure 2: Plots of the reward shaping F that will be added to the reward given the difference $\delta$ between the next potential and the previous potential with the hue representing the value of the previous potential. All plots are for $\gamma = 0.75$. For the plots of exponential PBRS the difference $\delta$ is defined as the difference in the original (linear) potential functions.
  • Figure 3: Gridworld results for the two different reward functions goal-directed and on-step. The figures in the top row show the results for the goal-directed reward function. The figures in the bottom row show the results for the on-step negative rewards. Each plot shows the length of the evaluation episodes for different values of the shifting bias $b$. Each graph showing the mean length of the ten evaluation runs with fixed exploration rate $\epsilon=0.05$ that were run every 250 training steps averaged over five separate training runs. Shaded areas showing the standard error of the mean.
  • Figure 4: Results of Cart Pole and Mountain Car experiments for different values of the bias parameter $b$. Each plot showing the length of the evaluation episodes that were run every 500 steps averaged over five evaluation runs per training run and plotting the mean of ten separate training runs with the shaded area being the standard error of the mean.