Table of Contents
Fetching ...

BAMDP Shaping: a Unified Framework for Intrinsic Motivation and Reward Shaping

Aly Lidayan, Michael Dennis, Stuart Russell

TL;DR

By formulating RL as a BAMDP, the paper unifies intrinsic motivation and reward shaping under a principled framework and identifies the Value of Information ($VOI$) and Value of Opportunity ($VOO$) as core components of BAMDP value. It then proposes BAMDP Potential-Based Shaping Functions (BAMPFs) that take the form $F(h_t)=\gamma\phi(h_t)-\phi(h_{t-1})$ to guide exploration without altering the underlying optimal policy, and proves a BAMDP PBS Theorem establishing Bayes-optimality preservation in both meta-RL and RL when using BAMPFs. Theoretical results are complemented by experiments on Bernoulli Bandits and Mountain Car, plus a Curiosity case study showing how many pseudo-rewards can be retrofitted as BAMPFs to resist reward hacking. The framework provides practical, retrofittable guidelines for designing intrinsic motivation and shaping terms that improve exploration while avoiding degenerate behaviors across diverse RL settings.

Abstract

Intrinsic motivation and reward shaping guide reinforcement learning (RL) agents by adding pseudo-rewards, which can lead to useful emergent behaviors. However, they can also encourage counterproductive exploits, e.g., fixation with noisy TV screens. Here we provide a theoretical model which anticipates these behaviors, and provides broad criteria under which adverse effects can be bounded. We characterize all pseudo-rewards as reward shaping in Bayes-Adaptive Markov Decision Processes (BAMDPs), which formulates the problem of learning in MDPs as an MDP over the agent's knowledge. Optimal exploration maximizes BAMDP state value, which we decompose into the value of the information gathered and the prior value of the physical state. Psuedo-rewards guide RL agents by rewarding behavior that increases these value components, while they hinder exploration when they align poorly with the actual value. We extend potential-based shaping theory to prove BAMDP Potential-based shaping Functions (BAMPFs) are immune to reward-hacking (convergence to behaviors maximizing composite rewards to the detriment of real rewards) in meta-RL, and show empirically how a BAMPF helps a meta-RL agent learn optimal RL algorithms for a Bernoulli Bandit domain. We finally prove that BAMPFs with bounded monotone increasing potentials also resist reward-hacking in the regular RL setting. We show that it is straightforward to retrofit or design new pseudo-reward terms in this form, and provide an empirical demonstration in the Mountain Car environment.

BAMDP Shaping: a Unified Framework for Intrinsic Motivation and Reward Shaping

TL;DR

By formulating RL as a BAMDP, the paper unifies intrinsic motivation and reward shaping under a principled framework and identifies the Value of Information () and Value of Opportunity () as core components of BAMDP value. It then proposes BAMDP Potential-Based Shaping Functions (BAMPFs) that take the form to guide exploration without altering the underlying optimal policy, and proves a BAMDP PBS Theorem establishing Bayes-optimality preservation in both meta-RL and RL when using BAMPFs. Theoretical results are complemented by experiments on Bernoulli Bandits and Mountain Car, plus a Curiosity case study showing how many pseudo-rewards can be retrofitted as BAMPFs to resist reward hacking. The framework provides practical, retrofittable guidelines for designing intrinsic motivation and shaping terms that improve exploration while avoiding degenerate behaviors across diverse RL settings.

Abstract

Intrinsic motivation and reward shaping guide reinforcement learning (RL) agents by adding pseudo-rewards, which can lead to useful emergent behaviors. However, they can also encourage counterproductive exploits, e.g., fixation with noisy TV screens. Here we provide a theoretical model which anticipates these behaviors, and provides broad criteria under which adverse effects can be bounded. We characterize all pseudo-rewards as reward shaping in Bayes-Adaptive Markov Decision Processes (BAMDPs), which formulates the problem of learning in MDPs as an MDP over the agent's knowledge. Optimal exploration maximizes BAMDP state value, which we decompose into the value of the information gathered and the prior value of the physical state. Psuedo-rewards guide RL agents by rewarding behavior that increases these value components, while they hinder exploration when they align poorly with the actual value. We extend potential-based shaping theory to prove BAMDP Potential-based shaping Functions (BAMPFs) are immune to reward-hacking (convergence to behaviors maximizing composite rewards to the detriment of real rewards) in meta-RL, and show empirically how a BAMPF helps a meta-RL agent learn optimal RL algorithms for a Bernoulli Bandit domain. We finally prove that BAMPFs with bounded monotone increasing potentials also resist reward-hacking in the regular RL setting. We show that it is straightforward to retrofit or design new pseudo-reward terms in this form, and provide an empirical demonstration in the Mountain Car environment.
Paper Structure (45 sections, 12 theorems, 54 equations, 9 figures, 1 table)

This paper contains 45 sections, 12 theorems, 54 equations, 9 figures, 1 table.

Key Result

Lemma 3.1

The Bayesian regret ghavamzadeh2015bayesian of algorithms acting on estimate $\mathsmaller{\hat{\Bar{Q}}}(\Bar{s}_t,a)$ can be expressed as:

Figures (9)

  • Figure 1: Example potential-based reward shaping functions $\gamma\phi'-\phi$ for potentials over the MDP state (PBSF, \ref{['fig:PBS']}), and BAMDP state (BAMPF, \ref{['fig:BAMPF']}). Start and goal MDP states $s_0,g$ are marked white and green; other states are colored by the pseudo-reward for transitioning there directly from $s_0$ at the start of the episode. Each episode ends and $s$ is reset to $s_0$ every 50 steps, $\gamma=0.99$.
  • Figure 2: The caterpillar domain formulated as a BAMDP. Left: prior $p(M)$ is a categorical distribution over MDPs $M_1$ and $M_2$; in both, all transitions are deterministic, with the curved and straight arrows corresponding to eat and go actions respectively. The caterpillar hatches at state $s_w$, and must decide whether to eat for guaranteed reward $21$, or incur $-5$ reward to go to $s_b$. Right: truncated BAMDP transition diagram, arrows are labeled with rewards (and transition probabilities if $p<1$). The stochastic transitions (from the highlighted eat action) are due to the uncertainty over the MDP; all future transitions (the highlighted arrows) become deterministic once its identity is revealed.
  • Figure 3: The effect of reward shaping on A2C meta-learning an RNN-based RL agent for Bernoulli Bandits with two arms, reward probabilities $(0.1,0.9)$ and a budget of 10 pulls. The mean and standard error of 10 seeds are plotted for each condition. Without shaping, the meta-learner gradually learns to generate RL agents that try fewer arms on average, avoiding over-exploration (grey curve in \ref{['fig:banditexploration']}). The 1st Winner Pulls BAMPF (green) sets $\phi$ to the pull count of the first arm that yielded a reward, helping A2C learn to exploit and achieve lower regret more quickly, while still converging to the optimal strategy. However, when this pull count is used directly as a pseudo-reward (1st Winner Pulls, purple), it causes the meta-learner to converge on an agent that over-exploits.
  • Figure 4: The effect of a bounded monotone BAMPF and entropy bonus pseudo-rewards on DQN in a 1-state MDP, where 1 in 100 total levers gives reward 10 when pulled. None refers to DQN without any pseudo-rewards. The BAMPF potential is the count of unique levers tried, and the Entropy reward is 10x the entropy of the last 10 lever pulls. The setting is non-episodic with $\gamma=0.9$. The mean and standard error of 32 seeds are plotted for each condition. See Appendix \ref{['appendix:lever_dqn']} for full details.
  • Figure 5: The effect of pseudo-rewards on PPO in Mountain Car; the mean and standard error of 10 seeds are plotted for each type of reward shaping. Displacement rewards the current $x$ displacement of the car, Displacement PBS is potential-based shaping with the current displacement as the MDP state potential $\phi(s)$, and Max Displacement BAMPF uses the exponentially smoothed maximum displacement over training (\ref{['fig:MCmaxpos']}) as the BAMDP state potential $\phi(h)$. With Displacement the agent learns a reward-hacking policy that avoids the goal to collect more pseudo-rewards (see Fig. \ref{['fig:MCshaped']}), while the BAMPF helps PPO learn to reach the goal more quickly while preserving optimality (\ref{['fig:MCrealreturn']}).
  • ...and 4 more figures

Theorems & Definitions (26)

  • Lemma 3.1
  • Definition 3.1: Value of Information
  • Definition 3.2: Value of Opportunity
  • Lemma 3.2: BAMDP Value Decomposition
  • Definition 4.1
  • Theorem 4.2: BAMDP Potential-Based Shaping Theorem
  • Theorem 4.3
  • Theorem A.1: BAMDP Potential-Based Shaping Theorem
  • proof
  • proof
  • ...and 16 more