Table of Contents
Fetching ...

Learning Markov State Abstractions for Deep Reinforcement Learning

Cameron Allen, Neev Parikh, Omer Gottesman, George Konidaris

TL;DR

This work introduces a novel set of conditions and proves that they are sufficient for learning a Markov abstract state representation, and describes a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions.

Abstract

A fundamental assumption of reinforcement learning in Markov decision processes (MDPs) is that the relevant decision process is, in fact, Markov. However, when MDPs have rich observations, agents typically learn by way of an abstract state representation, and such representations are not guaranteed to preserve the Markov property. We introduce a novel set of conditions and prove that they are sufficient for learning a Markov abstract state representation. We then describe a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions. Our novel training objective is compatible with both online and offline training: it does not require a reward signal, but agents can capitalize on reward information when available. We empirically evaluate our approach on a visual gridworld domain and a set of continuous control benchmarks. Our approach learns representations that capture the underlying structure of the domain and lead to improved sample efficiency over state-of-the-art deep reinforcement learning with visual features -- often matching or exceeding the performance achieved with hand-designed compact state information.

Learning Markov State Abstractions for Deep Reinforcement Learning

TL;DR

This work introduces a novel set of conditions and proves that they are sufficient for learning a Markov abstract state representation, and describes a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions.

Abstract

A fundamental assumption of reinforcement learning in Markov decision processes (MDPs) is that the relevant decision process is, in fact, Markov. However, when MDPs have rich observations, agents typically learn by way of an abstract state representation, and such representations are not guaranteed to preserve the Markov property. We introduce a novel set of conditions and prove that they are sufficient for learning a Markov abstract state representation. We then describe a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions. Our novel training objective is compatible with both online and offline training: it does not require a reward signal, but agents can capitalize on reward information when available. We empirically evaluate our approach on a visual gridworld domain and a set of continuous control benchmarks. Our approach learns representations that capture the underlying structure of the domain and lead to improved sample efficiency over state-of-the-art deep reinforcement learning with visual features -- often matching or exceeding the performance achieved with hand-designed compact state information.

Paper Structure

This paper contains 46 sections, 7 theorems, 29 equations, 10 figures, 3 tables.

Key Result

theorem 1

If $\phi:X\rightarrow Z$ is a state abstraction of MDP $M = (X,A,R,T,\gamma)$ such that for any policy $\pi$ in the policy class $\Pi_\phi$, the following conditions hold for every timestep $t$: Then $\phi$ is a Markov state abstraction.

Figures (10)

  • Figure 1: (Left) A $6\times 6$ visual gridworld domain with hidden state $s$ and unknown sensor $\sigma$, where an abstraction function $\phi$ maps each high-dimensional observed state $x$ to a lower-dimensional abstract state $z$ (orange circle). (Right) Our Markov abstraction training architecture. A shared encoder $\phi$ maps ground states $x,x'$ to abstract states $z,z'$, which are inputs to an inverse dynamics model $I$ and a contrastive model $D$ that discriminates between real and fake state transitions. The agent's policy $\pi$ depends only on the current abstract state.
  • Figure 2: An MDP and a non-Markov abstraction.
  • Figure 3: (a) Visualization of learning progress at selected times (left to right) of a 2-D state abstraction for the $6 \times 6$ visual gridworld domain: (top row) $\mathcal{L}_{Markov}$; (middle row) $\mathcal{L}_{Inv}$ only; (bottom row) $\mathcal{L}_{Ratio}$ only. Color denotes ground-truth $(x,y)$ position, which is not shown to the agent. (b) Mean episode reward for the visual gridworld navigation task. Markov abstractions significantly outperform end-to-end training with visual inputs, and match the performance of the expert $(x,y)$ position features. (300 seeds; 5-point moving average; shaded regions denote 95% confidence intervals.)
  • Figure 4: Mean episode reward vs. environment steps for DeepMind Control. Adding our Markov objective leads to improved learning performance. (10 seeds; 5-point moving average; shaded regions denote 90% confidence intervals; learning curve data is available at the linked code repository.)
  • Figure 5: An MDP and a Markov abstraction that is not a KI abstraction.
  • ...and 5 more figures

Theorems & Definitions (17)

  • definition 1: Markov State Representation
  • definition 2: Markov State Abstraction
  • theorem 1
  • corollary 1
  • lemma D.1
  • Proof 1
  • lemma D.2
  • Proof 2
  • Proof 3: of Theorem \ref{['thm:markov-conditions']}
  • Proof 4: of Corollary \ref{['corollary:markov-B-implies-MDP']}
  • ...and 7 more