Table of Contents
Fetching ...

SOAP-RL: Sequential Option Advantage Propagation for Reinforcement Learning in POMDP Environments

Shu Ishida, João F. Henriques

TL;DR

Evaluated against competing baselines, SOAP exhibited the most robust performance, correctly discovering options for POMDP corridor environments, as well as on standard benchmarks including Atari and MuJoCo, outperforming PPOEM, as well as LSTM and Option-Critic baselines.

Abstract

This work compares ways of extending Reinforcement Learning algorithms to Partially Observed Markov Decision Processes (POMDPs) with options. One view of options is as temporally extended action, which can be realized as a memory that allows the agent to retain historical information beyond the policy's context window. While option assignment could be handled using heuristics and hand-crafted objectives, learning temporally consistent options and associated sub-policies without explicit supervision is a challenge. Two algorithms, PPOEM and SOAP, are proposed and studied in depth to address this problem. PPOEM applies the forward-backward algorithm (for Hidden Markov Models) to optimize the expected returns for an option-augmented policy. However, this learning approach is unstable during on-policy rollouts. It is also unsuited for learning causal policies without the knowledge of future trajectories, since option assignments are optimized for offline sequences where the entire episode is available. As an alternative approach, SOAP evaluates the policy gradient for an optimal option assignment. It extends the concept of the generalized advantage estimation (GAE) to propagate option advantages through time, which is an analytical equivalent to performing temporal back-propagation of option policy gradients. This option policy is only conditional on the history of the agent, not future actions. Evaluated against competing baselines, SOAP exhibited the most robust performance, correctly discovering options for POMDP corridor environments, as well as on standard benchmarks including Atari and MuJoCo, outperforming PPOEM, as well as LSTM and Option-Critic baselines. The open-sourced code is available at https://github.com/shuishida/SoapRL.

SOAP-RL: Sequential Option Advantage Propagation for Reinforcement Learning in POMDP Environments

TL;DR

Evaluated against competing baselines, SOAP exhibited the most robust performance, correctly discovering options for POMDP corridor environments, as well as on standard benchmarks including Atari and MuJoCo, outperforming PPOEM, as well as LSTM and Option-Critic baselines.

Abstract

This work compares ways of extending Reinforcement Learning algorithms to Partially Observed Markov Decision Processes (POMDPs) with options. One view of options is as temporally extended action, which can be realized as a memory that allows the agent to retain historical information beyond the policy's context window. While option assignment could be handled using heuristics and hand-crafted objectives, learning temporally consistent options and associated sub-policies without explicit supervision is a challenge. Two algorithms, PPOEM and SOAP, are proposed and studied in depth to address this problem. PPOEM applies the forward-backward algorithm (for Hidden Markov Models) to optimize the expected returns for an option-augmented policy. However, this learning approach is unstable during on-policy rollouts. It is also unsuited for learning causal policies without the knowledge of future trajectories, since option assignments are optimized for offline sequences where the entire episode is available. As an alternative approach, SOAP evaluates the policy gradient for an optimal option assignment. It extends the concept of the generalized advantage estimation (GAE) to propagate option advantages through time, which is an analytical equivalent to performing temporal back-propagation of option policy gradients. This option policy is only conditional on the history of the agent, not future actions. Evaluated against competing baselines, SOAP exhibited the most robust performance, correctly discovering options for POMDP corridor environments, as well as on standard benchmarks including Atari and MuJoCo, outperforming PPOEM, as well as LSTM and Option-Critic baselines. The open-sourced code is available at https://github.com/shuishida/SoapRL.
Paper Structure (28 sections, 39 equations, 9 figures, 2 tables)

This paper contains 28 sections, 39 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: An HMM for sequential data $\bm{X}$ of length $T$, given latent variables $\bm{Z}$.
  • Figure 2: Probabilistic graphical models showing the relationships between options $z$, actions $a$ and states $s$ at time step $t$. $b_t$ in the standard options framework denotes a boolean variable that initiates the switching of options when activated. This work adopts a more general formulation compared to the options framework, as defined in \ref{['eq:ppoem/joint_option_policy']}.
  • Figure 3: An HMM showing the relationships between options $z$, actions $a$ and states $s$. The dotted arrows indicate that the same pattern repeats where the intermediate time steps are abbreviated.
  • Figure 4: A corridor environment. The above example has a length $L=20$. The agent represented as a green circle starts at the left end of the corridor, and moves towards the right. When it reaches the right end, the agent can either take an up action or a down action. This will either take the agent to a yellow cell or a grey cell. The yellow cell gives a reward of $1$, while the grey cell gives a reward of $-1$. All other cells give a reward of $0$. The location of a rewarding yellow cell and the penalizing grey cell are determined by the color of the starting cell (either "blue" or "red"), as shown, and this is randomized, each with $50\%$ probability. The agent only has access to the color of the current cell as observation. For simplicity of implementation, the agent's action space is {"up", "down"}, and apart from the fork at the right end, taking either of the actions at each time step will move the agent one cell to the right. The images shown are taken from rollouts of the SOAP agent after training for $100k$ steps. The agent successfully navigated to the rewarding cell in both cases.
  • Figure 5: Training curves of RL agents showing the episodic rewards obtained in the corridor environment with varying lengths. The mean (solid line) and the min-max range (colored shadow) for $5$ seeds per algorithm are shown.
  • ...and 4 more figures