Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy

Cameron Allen; Aaron Kirtland; Ruo Yu Tao; Sam Lobel; Daniel Scott; Nicholas Petrocelli; Omer Gottesman; Ronald Parr; Michael L. Littman; George Konidaris

Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy

Cameron Allen, Aaron Kirtland, Ruo Yu Tao, Sam Lobel, Daniel Scott, Nicholas Petrocelli, Omer Gottesman, Ronald Parr, Michael L. Littman, George Konidaris

TL;DR

It is proved that the $\lambda$-discrepancy is exactly zero for all Markov decision processes and almost always non-zero for a broad class of partially observable environments and, once detected, minimizing the $\lambda$-discrepancy can help with learning a memory function to mitigate the corresponding partial observability.

Abstract

Reinforcement learning algorithms typically rely on the assumption that the environment dynamics and value function can be expressed in terms of a Markovian state representation. However, when state information is only partially observable, how can an agent learn such a state representation, and how can it detect when it has found one? We introduce a metric that can accomplish both objectives, without requiring access to -- or knowledge of -- an underlying, unobservable state space. Our metric, the $λ$-discrepancy, is the difference between two distinct temporal difference (TD) value estimates, each computed using TD($λ$) with a different value of $λ$. Since TD($λ{=}0$) makes an implicit Markov assumption and TD($λ{=}1$) does not, a discrepancy between these estimates is a potential indicator of a non-Markovian state representation. Indeed, we prove that the $λ$-discrepancy is exactly zero for all Markov decision processes and almost always non-zero for a broad class of partially observable environments. We also demonstrate empirically that, once detected, minimizing the $λ$-discrepancy can help with learning a memory function to mitigate the corresponding partial observability. We then train a reinforcement learning agent that simultaneously constructs two recurrent value networks with different $λ$ parameters and minimizes the difference between them as an auxiliary loss. The approach scales to challenging partially observable domains, where the resulting agent frequently performs significantly better (and never performs worse) than a baseline recurrent agent with only a single value network.

Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy

TL;DR

It is proved that the

-discrepancy is exactly zero for all Markov decision processes and almost always non-zero for a broad class of partially observable environments and, once detected, minimizing the

-discrepancy can help with learning a memory function to mitigate the corresponding partial observability.

Abstract

-discrepancy, is the difference between two distinct temporal difference (TD) value estimates, each computed using TD(

) with a different value of

. Since TD(

) makes an implicit Markov assumption and TD(

) does not, a discrepancy between these estimates is a potential indicator of a non-Markovian state representation. Indeed, we prove that the

-discrepancy is exactly zero for all Markov decision processes and almost always non-zero for a broad class of partially observable environments. We also demonstrate empirically that, once detected, minimizing the

-discrepancy can help with learning a memory function to mitigate the corresponding partial observability. We then train a reinforcement learning agent that simultaneously constructs two recurrent value networks with different

parameters and minimizes the difference between them as an auxiliary loss. The approach scales to challenging partially observable domains, where the resulting agent frequently performs significantly better (and never performs worse) than a baseline recurrent agent with only a single value network.

Paper Structure (41 sections, 2 theorems, 48 equations, 15 figures, 3 tables, 3 algorithms)

This paper contains 41 sections, 2 theorems, 48 equations, 15 figures, 3 tables, 3 algorithms.

Introduction
Background
Detecting Partial Observability
Value Function Estimation under Partial Observability
Lambda-Discrepancy
What conditions cause the Lambda-discrepancy to be zero?
Parity Check Environment.
Memory Learning with the Lambda-Discrepancy
A Scalable, Online Learning Objective
Combining the Lambda-Discrepancy with PPO
Large Partially Observable Environments
Experiments
Related Work
Conclusion
Limitations
...and 26 more sections

Key Result

Theorem 1

Given a POMDP model $\mathcal{P}$ and distinct $\lambda \ne \lambda'$, if there exists a policy $\pi: \Omega \to \Delta \mathcal{A}$ such that $\Lambda_{\mathcal{P},\pi}^{\lambda,\lambda'} \neq 0$, then $\Lambda_{\mathcal{P},\pi}^{\lambda,\lambda'}\ne 0$ for all policies except at most a set of meas

Figures (15)

Figure 1: T-maze decision process. The agent must remember the initial observation to earn the maximum reward (+1).
Figure 1: Hyperparameters swept across all algorithms. Rows labelled with $\lambda$-discrepancy are hyperparameters swept specific to our algorithm.
Figure 2: Best hyperparameters for each environment and each algorithm. Hyperparameters were found using 5 seeds, and taking the maximum AUC.
Figure 3: T-Maze $\lambda$-discrepancy, mixing between full and partial observability. (Left) MDP observation function $\Phi_\mathrm{Perfect}$. (Right) Various POMDP observation functions $\Phi_\mathrm{Aliased}$ that produce aliased observations at the corridor states, junction states, or both. State indices correspond to starting states (0, 1), hallway (2--11), junctions (12, 13), and terminal state (14). Brighter squares indicate higher probability. (Center) $\lambda$-discrepancy has a minimum at zero for full observability and increases with partial observability. We interpolate between perfect observations and aliased ones, where the observation function is $\Phi = (1-p) \cdot \Phi_\mathrm{Perfect} + p \cdot \Phi_\mathrm{Aliased}$.
Figure 3: Environment-specific hyperparameters, set across all algorithms. We set the entropy coefficient to a higher value in RockSample because the environment requires more exploration.
...and 10 more figures

Theorems & Definitions (3)

Definition 1
Theorem 1
Theorem 2

Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy

TL;DR

Abstract

Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (3)