BAMDP Shaping: a Unified Framework for Intrinsic Motivation and Reward Shaping

Aly Lidayan; Michael Dennis; Stuart Russell

BAMDP Shaping: a Unified Framework for Intrinsic Motivation and Reward Shaping

Aly Lidayan, Michael Dennis, Stuart Russell

TL;DR

By formulating RL as a BAMDP, the paper unifies intrinsic motivation and reward shaping under a principled framework and identifies the Value of Information ($VOI$) and Value of Opportunity ($VOO$) as core components of BAMDP value. It then proposes BAMDP Potential-Based Shaping Functions (BAMPFs) that take the form $F(h_t)=\gamma\phi(h_t)-\phi(h_{t-1})$ to guide exploration without altering the underlying optimal policy, and proves a BAMDP PBS Theorem establishing Bayes-optimality preservation in both meta-RL and RL when using BAMPFs. Theoretical results are complemented by experiments on Bernoulli Bandits and Mountain Car, plus a Curiosity case study showing how many pseudo-rewards can be retrofitted as BAMPFs to resist reward hacking. The framework provides practical, retrofittable guidelines for designing intrinsic motivation and shaping terms that improve exploration while avoiding degenerate behaviors across diverse RL settings.

Abstract

Intrinsic motivation and reward shaping guide reinforcement learning (RL) agents by adding pseudo-rewards, which can lead to useful emergent behaviors. However, they can also encourage counterproductive exploits, e.g., fixation with noisy TV screens. Here we provide a theoretical model which anticipates these behaviors, and provides broad criteria under which adverse effects can be bounded. We characterize all pseudo-rewards as reward shaping in Bayes-Adaptive Markov Decision Processes (BAMDPs), which formulates the problem of learning in MDPs as an MDP over the agent's knowledge. Optimal exploration maximizes BAMDP state value, which we decompose into the value of the information gathered and the prior value of the physical state. Psuedo-rewards guide RL agents by rewarding behavior that increases these value components, while they hinder exploration when they align poorly with the actual value. We extend potential-based shaping theory to prove BAMDP Potential-based shaping Functions (BAMPFs) are immune to reward-hacking (convergence to behaviors maximizing composite rewards to the detriment of real rewards) in meta-RL, and show empirically how a BAMPF helps a meta-RL agent learn optimal RL algorithms for a Bernoulli Bandit domain. We finally prove that BAMPFs with bounded monotone increasing potentials also resist reward-hacking in the regular RL setting. We show that it is straightforward to retrofit or design new pseudo-reward terms in this form, and provide an empirical demonstration in the Mountain Car environment.

BAMDP Shaping: a Unified Framework for Intrinsic Motivation and Reward Shaping

TL;DR

By formulating RL as a BAMDP, the paper unifies intrinsic motivation and reward shaping under a principled framework and identifies the Value of Information (

) and Value of Opportunity (

) as core components of BAMDP value. It then proposes BAMDP Potential-Based Shaping Functions (BAMPFs) that take the form

to guide exploration without altering the underlying optimal policy, and proves a BAMDP PBS Theorem establishing Bayes-optimality preservation in both meta-RL and RL when using BAMPFs. Theoretical results are complemented by experiments on Bernoulli Bandits and Mountain Car, plus a Curiosity case study showing how many pseudo-rewards can be retrofitted as BAMPFs to resist reward hacking. The framework provides practical, retrofittable guidelines for designing intrinsic motivation and shaping terms that improve exploration while avoiding degenerate behaviors across diverse RL settings.

Abstract

Paper Structure (45 sections, 12 theorems, 54 equations, 9 figures, 1 table)

This paper contains 45 sections, 12 theorems, 54 equations, 9 figures, 1 table.

Introduction
Background
Markov Decision Processes
Intrinsic Motivation and Reward Shaping
Formulation of RL Problems as BAMDPs
Pseudo-rewards Correct BAMDP Value Misestimation
The Relationship Between Value Misestimation and Regret
BAMDP Value Decomposition
Preserving Optimality with BAMDP Potential-Based Shaping
Definition of BAMDP Potential-Based Shaping Functions
BAMPFs Preserve Optimality in Meta-RL
BAMDP Potential-Based Shaping Theorem
Experiment: Shaping Meta-RL on Bernoulli Bandits
Preserving Optimality in RL with BAMPFs
Experiment: Shaping RL in Mountain Car
...and 30 more sections

Key Result

Lemma 3.1

The Bayesian regret ghavamzadeh2015bayesian of algorithms acting on estimate $\mathsmaller{\hat{\Bar{Q}}}(\Bar{s}_t,a)$ can be expressed as:

Figures (9)

Figure 1: Example potential-based reward shaping functions $\gamma\phi'-\phi$ for potentials over the MDP state (PBSF, \ref{['fig:PBS']}), and BAMDP state (BAMPF, \ref{['fig:BAMPF']}). Start and goal MDP states $s_0,g$ are marked white and green; other states are colored by the pseudo-reward for transitioning there directly from $s_0$ at the start of the episode. Each episode ends and $s$ is reset to $s_0$ every 50 steps, $\gamma=0.99$.
Figure 2: The caterpillar domain formulated as a BAMDP. Left: prior $p(M)$ is a categorical distribution over MDPs $M_1$ and $M_2$; in both, all transitions are deterministic, with the curved and straight arrows corresponding to eat and go actions respectively. The caterpillar hatches at state $s_w$, and must decide whether to eat for guaranteed reward $21$, or incur $-5$ reward to go to $s_b$. Right: truncated BAMDP transition diagram, arrows are labeled with rewards (and transition probabilities if $p<1$). The stochastic transitions (from the highlighted eat action) are due to the uncertainty over the MDP; all future transitions (the highlighted arrows) become deterministic once its identity is revealed.
Figure 3: The effect of reward shaping on A2C meta-learning an RNN-based RL agent for Bernoulli Bandits with two arms, reward probabilities $(0.1,0.9)$ and a budget of 10 pulls. The mean and standard error of 10 seeds are plotted for each condition. Without shaping, the meta-learner gradually learns to generate RL agents that try fewer arms on average, avoiding over-exploration (grey curve in \ref{['fig:banditexploration']}). The 1st Winner Pulls BAMPF (green) sets $\phi$ to the pull count of the first arm that yielded a reward, helping A2C learn to exploit and achieve lower regret more quickly, while still converging to the optimal strategy. However, when this pull count is used directly as a pseudo-reward (1st Winner Pulls, purple), it causes the meta-learner to converge on an agent that over-exploits.
Figure 4: The effect of a bounded monotone BAMPF and entropy bonus pseudo-rewards on DQN in a 1-state MDP, where 1 in 100 total levers gives reward 10 when pulled. None refers to DQN without any pseudo-rewards. The BAMPF potential is the count of unique levers tried, and the Entropy reward is 10x the entropy of the last 10 lever pulls. The setting is non-episodic with $\gamma=0.9$. The mean and standard error of 32 seeds are plotted for each condition. See Appendix \ref{['appendix:lever_dqn']} for full details.
Figure 5: The effect of pseudo-rewards on PPO in Mountain Car; the mean and standard error of 10 seeds are plotted for each type of reward shaping. Displacement rewards the current $x$ displacement of the car, Displacement PBS is potential-based shaping with the current displacement as the MDP state potential $\phi(s)$, and Max Displacement BAMPF uses the exponentially smoothed maximum displacement over training (\ref{['fig:MCmaxpos']}) as the BAMDP state potential $\phi(h)$. With Displacement the agent learns a reward-hacking policy that avoids the goal to collect more pseudo-rewards (see Fig. \ref{['fig:MCshaped']}), while the BAMPF helps PPO learn to reach the goal more quickly while preserving optimality (\ref{['fig:MCrealreturn']}).
...and 4 more figures

Theorems & Definitions (26)

Lemma 3.1
Definition 3.1: Value of Information
Definition 3.2: Value of Opportunity
Lemma 3.2: BAMDP Value Decomposition
Definition 4.1
Theorem 4.2: BAMDP Potential-Based Shaping Theorem
Theorem 4.3
Theorem A.1: BAMDP Potential-Based Shaping Theorem
proof
proof
...and 16 more

BAMDP Shaping: a Unified Framework for Intrinsic Motivation and Reward Shaping

TL;DR

Abstract

BAMDP Shaping: a Unified Framework for Intrinsic Motivation and Reward Shaping

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (26)