Skill or Luck? Return Decomposition via Advantage Functions

Hsiao-Ru Pan; Bernhard Schölkopf

Skill or Luck? Return Decomposition via Advantage Functions

Hsiao-Ru Pan, Bernhard Schölkopf

TL;DR

This work builds on the insight that the advantage function can be understood as the causal effect of an action on the return, and shows that this allows it to decompose the return of a trajectory into parts caused by the agent's actions and parts outside of the agent's control.

Abstract

Learning from off-policy data is essential for sample-efficient reinforcement learning. In the present work, we build on the insight that the advantage function can be understood as the causal effect of an action on the return, and show that this allows us to decompose the return of a trajectory into parts caused by the agent's actions (skill) and parts outside of the agent's control (luck). Furthermore, this decomposition enables us to naturally extend Direct Advantage Estimation (DAE) to off-policy settings (Off-policy DAE). The resulting method can learn from off-policy trajectories without relying on importance sampling techniques or truncating off-policy actions. We draw connections between Off-policy DAE and previous methods to demonstrate how it can speed up learning and when the proposed off-policy corrections are important. Finally, we use the MinAtar environments to illustrate how ignoring off-policy corrections can lead to suboptimal policy optimization performance.

Skill or Luck? Return Decomposition via Advantage Functions

TL;DR

Abstract

Paper Structure (32 sections, 2 theorems, 25 equations, 13 figures, 2 tables, 2 algorithms)

This paper contains 32 sections, 2 theorems, 25 equations, 13 figures, 2 tables, 2 algorithms.

Introduction
Background
Direct Advantage Estimation
Multi-step learning
Return Decomposition
The Deterministic Case
The Stochastic Case
Approximating the constraint
Relationship to other methods
Monte-Carlo Methods
The Uncorrected Method
Experiments
Environment
Agent Design
Results
...and 17 more sections

Key Result

Theorem 1

Given a behavior policy $\mu$, a target policy $\pi$, and backup length $n\geq 0$. Let $\hat{A}_{t}=\hat{A}(s_{t}, a_{t})$, $\hat{B}_{t}=\hat{B}(s_{t}, a_{t}, s_{t+1})$, and the objective function then $(A^\pi, B^\pi, V^\pi)$ is a minimizer of the above problem. Furthermore, the minimizer is unique if $\mu$ is sufficiently explorative (i.e., non-zero probability of reaching all possible transition

Figures (13)

Figure 1: A two-step view of the state transition process. First, we introduce an imaginary agent nature, which controls the stochastic part of the transition process. In this view, nature lives in a world with state space $\Bar{\mathcal{S}}=\mathcal{S}\times\mathcal{A}$ and action space $\Bar{\mathcal{A}}=\mathcal{S}$. At each time step $t$, the agent chooses its action $a_t$ based on $s_t$, and, instead of transitioning directly into the next state, it transitions into an intermediate state denoted $(s_t,a_t)\in\Bar{\mathcal{S}}$, where nature chooses the next state $s_{t+1}\in\Bar{\mathcal{A}}$ based on $(s_t, a_t)$. We use nodes and arrows to represent states and actions by the agent (red) and nature (blue).
Figure 2: Latent variable model of transitions; $\mathcal{Z}$ is a discrete latent space, which can be understood as actions from nature.
Figure 3: Left: An MDP with $\mathcal{S}=\{1,2,3,4\}$. Both states 1 and 2 have only a single action with immediate rewards 0 that leads to state 3. State 3 has two actions, $\mathtt{u}$ and $\mathtt{d}$, that lead to the terminal state 4 with immediate rewards 1 and 0, respectively. Right: We compare the values estimated by Batch TD(0), MC, and DAE with trajectories sampled from the uniform policy. Lines and shadings represent the average and one standard deviation of the estimated values over 1000 random seeds. The dashed line represents the true value $V(1)=V(2)=0.5$. See Appendix \ref{['app:classic']} for details.
Figure 4: Left: An MDP extended from Figure \ref{['fig:classic']}. Instead of terminating at state 4, the agent transitions randomly to state 5 or 6 with equal probabilities. Both states 5 and 6 have a single action, with rewards 1 and 0, respectively. State 7 is the terminal state. Right: We compare the values (with uniform policy) estimated by DAE, Off-policy DAE (learned transition probabilities), and Off-policy DAE (oracle, known transition probabilities). Lines and shadings represent the average and one standard deviation of the estimated values over 1000 random seeds. The dashed line represents the true value $V(1)=V(2)=1$.
Figure 5: Normalized training curves aggregated over deterministic (left) and stochastic (right) environments. Scores were first normalized using the PPO-DAE baseline and then aggregated over 20 random seeds, environments, and backup lengths. Lines and shadings represent the means and 1 standard error of the means, respectively. The dotted horizontal lines shows the PPO-DAE baseline.
...and 8 more figures

Theorems & Definitions (3)

Theorem 1: Off-policy DAE
Theorem : Off-policy DAE
proof

Skill or Luck? Return Decomposition via Advantage Functions

TL;DR

Abstract

Skill or Luck? Return Decomposition via Advantage Functions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (3)