Table of Contents
Fetching ...

RUDDER: Return Decomposition for Delayed Rewards

Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, Sepp Hochreiter

TL;DR

RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward, and return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels.

Abstract

We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs). In MDPs the Q-values are equal to the expected immediate reward plus the expected future rewards. The latter are related to bias problems in temporal difference (TD) learning and to high variance problems in Monte Carlo (MC) learning. Both problems are even more severe when rewards are delayed. RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward. We propose the following two new concepts to push the expected future rewards toward zero. (i) Reward redistribution that leads to return-equivalent decision processes with the same optimal policies and, when optimal, zero expected future rewards. (ii) Return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels. On artificial tasks with delayed rewards, RUDDER is significantly faster than MC and exponentially faster than Monte Carlo Tree Search (MCTS), TD(λ), and reward shaping approaches. At Atari games, RUDDER on top of a Proximal Policy Optimization (PPO) baseline improves the scores, which is most prominent at games with delayed rewards. Source code is available at \url{https://github.com/ml-jku/rudder} and demonstration videos at \url{https://goo.gl/EQerZV}.

RUDDER: Return Decomposition for Delayed Rewards

TL;DR

RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward, and return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels.

Abstract

We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs). In MDPs the Q-values are equal to the expected immediate reward plus the expected future rewards. The latter are related to bias problems in temporal difference (TD) learning and to high variance problems in Monte Carlo (MC) learning. Both problems are even more severe when rewards are delayed. RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward. We propose the following two new concepts to push the expected future rewards toward zero. (i) Reward redistribution that leads to return-equivalent decision processes with the same optimal policies and, when optimal, zero expected future rewards. (ii) Return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels. On artificial tasks with delayed rewards, RUDDER is significantly faster than MC and exponentially faster than Monte Carlo Tree Search (MCTS), TD(λ), and reward shaping approaches. At Atari games, RUDDER on top of a Proximal Policy Optimization (PPO) baseline improves the scores, which is most prominent at games with delayed rewards. Source code is available at \url{https://github.com/ml-jku/rudder} and demonstration videos at \url{https://goo.gl/EQerZV}.

Paper Structure

This paper contains 138 sections, 33 theorems, 328 equations, 16 figures, 7 tables.

Key Result

Theorem 1

Both the SDP $\tilde{\mathcal{P}}$ with delayed reward $\tilde{R}_{t+1}$ and the SDP $\mathcal{P}$ with redistributed reward $R_{t+1}$ have the same optimal policies.

Figures (16)

  • Figure 1: Comparison of RUDDER and other methods on artificial tasks with respect to the learning time in episodes (median of 100 trials) vs. the delay of the reward. The shadow bands indicate the $40\%$ and $60\%$ quantiles. In (II), the y-axis of the inlet is scaled by $10^{5}$. In (III), reward shaping (RS), look-ahead advice (look-ahead), and look-back advice (look-back) use three different potential functions. In (III), the dashed blue line represents RUDDER with $Q$($\lambda$), in contrast to RUDDER with $Q$-estimation. In all tasks, RUDDER significantly outperforms all other methods.
  • Figure 2: RUDDER redistributes rewards to key events in the Atari game Bowling. Originally, rewards are delayed and only given at episode end. The first 120 out of 200 frames of the episode are shown. RUDDER identifies key actions that steer the ball to hit all pins.
  • Figure A1: Examples of how affected states (cyan) affect states in a previous time step (indicated by cyan edges) starting with $n_5=1$ (one affected state). The left panel shows no overlap since affected states in $s_{t-1}$ connect only to one affected state in $s_t$. The right panel shows some overlap since affected states in $s_{t-1}$ connect to multiple affected states in $s_t$.
  • Figure A2: The function $\left( 1 - \left( 1 - \frac{c_t}{N_{t-1}} \right)^{n_t} \right)$ which scales $N_{t-1}$ in Theorem \ref{['th:Aaffect']}. This function determines the growth of $a_k$, which is exponentially at the beginning, and then linearly when the function approaches 1.
  • Figure A3: (a) Experimental evaluation of bias and variance of different $Q$-value estimators on the Grid World. (b) Normalized bias reduction for different delays. Right: Average variance reduction for the 10th highest values.
  • ...and 11 more figures

Theorems & Definitions (74)

  • Theorem 1
  • Definition 1
  • Theorem 2
  • Theorem 3
  • Definition A1: Ravindran and Barto Ravindran:01Ravindran:03
  • Lemma A1: Ravindran and Barto Ravindran:01
  • Definition A2
  • Proposition A1
  • proof
  • Definition A3
  • ...and 64 more