Potential-Based Reward Shaping For Intrinsic Motivation

Grant C. Forbes; Nitish Gupta; Leonardo Villalobos-Arias; Colin M. Potts; Arnav Jhala; David L. Roberts

Potential-Based Reward Shaping For Intrinsic Motivation

Grant C. Forbes, Nitish Gupta, Leonardo Villalobos-Arias, Colin M. Potts, Arnav Jhala, David L. Roberts

TL;DR

The paper addresses the risk that intrinsic motivation (IM) rewards can alter the set of optimal policies in reinforcement learning. It extends potential-based reward shaping (PBRS) to potentials that depend on arbitrary variables and introduces PBIM, a practical method to convert IM rewards into a potential-based form while preserving optimality, supported by a boundary-condition theorem. The authors provide both a non-normalized and a normalized PBIM variant, with theoretical guarantees and an empirical demonstration on MiniGrid DoorKey and Cliff Walking showing reduced reward hacking and, in some tasks, accelerated training. Normalized PBIM, in particular, can match baseline no-IM performance in challenging sparse-reward settings and offer robust improvements across multiple scenarios.

Abstract

Recently there has been a proliferation of intrinsic motivation (IM) reward-shaping methods to learn in complex and sparse-reward environments. These methods can often inadvertently change the set of optimal policies in an environment, leading to suboptimal behavior. Previous work on mitigating the risks of reward shaping, particularly through potential-based reward shaping (PBRS), has not been applicable to many IM methods, as they are often complex, trainable functions themselves, and therefore dependent on a wider set of variables than the traditional reward functions that PBRS was developed for. We present an extension to PBRS that we prove preserves the set of optimal policies under a more general set of functions than has been previously proven. We also present {\em Potential-Based Intrinsic Motivation} (PBIM), a method for converting IM rewards into a potential-based form that is useable without altering the set of optimal policies. Testing in the MiniGrid DoorKey and Cliff Walking environments, we demonstrate that PBIM successfully prevents the agent from converging to a suboptimal policy and can speed up training.

Potential-Based Reward Shaping For Intrinsic Motivation

TL;DR

Abstract

Paper Structure (19 sections, 2 theorems, 27 equations, 7 figures, 1 table)

This paper contains 19 sections, 2 theorems, 27 equations, 7 figures, 1 table.

Introduction
Related Work
Potential-Based Reward Shaping
Intrinsic Motivation
Main Results
Extending Potential-Based Reward Shaping to Functions of Arbitrary Variables
Converting Functions of Arbitrary Variables to Potential-Based Reward Functions
Empirical Demonstration
MiniGrid DoorKey
Discussion
Cliff Walking
Discussion
Longer Cliff Walking
Conclusion
MiniGrid DoorKey Environment
...and 4 more sections

Key Result

Theorem 1

The addition of a shaping reward $F_t = \gamma \Phi_{t+1} - \Phi_t$ leaves the set of optimal policies unchanged if Equation boundary_condition holds.

Figures (7)

Figure 1: (\ref{['fig:025-full']}), (\ref{['frame_results_02']}), & (\ref{['frame_results_005']}) Frames per episode for each method (lower is better). The shaded region represents standard deviation, and plots are of a 100-point moving average. \ref{['fig:zoomed']}) Same results as (\ref{['fig:025-full']}), but zoomed in. All differences in means are significant. For IM + PBRS, IM no PBRS (\ref{['fig:025-full']}) $T = 36.5, p < 0.01$. For IM + PBRS, IM no PBRS (\ref{['frame_results_02']}) $T = 27.4, p < 0.01$. In (\ref{['frame_results_005']}), No IM converges lower than IM + PBRS, which converges lower than IM + PBRS no norm, which converges lower than IM no PBRS. Respectively, for each of these pairings, $T = 4.3, p < 0.01$, $T = 6.1, p < 0.01$, and $T = 1.8, p = 0.32$. While the last of these isn't significant, the difference between IM + PBRS and IM no PBRS is, with $T = 7.9, p < 0.01$.
Figure 2: Average cumulative extrinsic return and episode length for the cliff walking environment. Error bars are standard deviations over 10 runs. Differences in means between returns of No IM and RND ($p < 0.05$) and between returns of No IM and PBIM No Norm ($p < 0.05$) are statistically significant. Mean episode lengths of both PBIM norm and no IM are statistically different from both PBIM no norm and RND ($p < .05$ for all).
Figure 3: Final policies of trained agents and their estimated Q-values. Arrows indicate the action with the highest estimated Q-value in each position. A brighter hue indicates a higher Q-value.
Figure 4: Average cumulative return for the large cliff walking environment.
Figure 5: An example MiniGrid DoorKey 8x8 environment.
...and 2 more figures

Theorems & Definitions (2)

Theorem 1: Sufficient Condition For Optimality
Theorem 2: PBIM Preserves Optimality

Potential-Based Reward Shaping For Intrinsic Motivation

TL;DR

Abstract

Potential-Based Reward Shaping For Intrinsic Motivation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (2)