Table of Contents
Fetching ...

Multistep Inverse Is Not All You Need

Alexander Levine, Peter Stone, Amy Zhang

TL;DR

A new algorithm, ACDF, is proposed, which combines multistep-inverse prediction with a latent forward model, and is guaranteed to correctly infer an action-dependent latent state encoder for a large class of Ex-BMDP models.

Abstract

In real-world control settings, the observation space is often unnecessarily high-dimensional and subject to time-correlated noise. However, the controllable dynamics of the system are often far simpler than the dynamics of the raw observations. It is therefore desirable to learn an encoder to map the observation space to a simpler space of control-relevant variables. In this work, we consider the Ex-BMDP model, first proposed by Efroni et al. (2022), which formalizes control problems where observations can be factorized into an action-dependent latent state which evolves deterministically, and action-independent time-correlated noise. Lamb et al. (2022) proposes the "AC-State" method for learning an encoder to extract a complete action-dependent latent state representation from the observations in such problems. AC-State is a multistep-inverse method, in that it uses the encoding of the the first and last state in a path to predict the first action in the path. However, we identify cases where AC-State will fail to learn a correct latent representation of the agent-controllable factor of the state. We therefore propose a new algorithm, ACDF, which combines multistep-inverse prediction with a latent forward model. ACDF is guaranteed to correctly infer an action-dependent latent state encoder for a large class of Ex-BMDP models. We demonstrate the effectiveness of ACDF on tabular Ex-BMDPs through numerical simulations; as well as high-dimensional environments using neural-network-based encoders. Code is available at https://github.com/midi-lab/acdf.

Multistep Inverse Is Not All You Need

TL;DR

A new algorithm, ACDF, is proposed, which combines multistep-inverse prediction with a latent forward model, and is guaranteed to correctly infer an action-dependent latent state encoder for a large class of Ex-BMDP models.

Abstract

In real-world control settings, the observation space is often unnecessarily high-dimensional and subject to time-correlated noise. However, the controllable dynamics of the system are often far simpler than the dynamics of the raw observations. It is therefore desirable to learn an encoder to map the observation space to a simpler space of control-relevant variables. In this work, we consider the Ex-BMDP model, first proposed by Efroni et al. (2022), which formalizes control problems where observations can be factorized into an action-dependent latent state which evolves deterministically, and action-independent time-correlated noise. Lamb et al. (2022) proposes the "AC-State" method for learning an encoder to extract a complete action-dependent latent state representation from the observations in such problems. AC-State is a multistep-inverse method, in that it uses the encoding of the the first and last state in a path to predict the first action in the path. However, we identify cases where AC-State will fail to learn a correct latent representation of the agent-controllable factor of the state. We therefore propose a new algorithm, ACDF, which combines multistep-inverse prediction with a latent forward model. ACDF is guaranteed to correctly infer an action-dependent latent state encoder for a large class of Ex-BMDP models. We demonstrate the effectiveness of ACDF on tabular Ex-BMDPs through numerical simulations; as well as high-dimensional environments using neural-network-based encoders. Code is available at https://github.com/midi-lab/acdf.
Paper Structure (38 sections, 8 theorems, 41 equations, 15 figures, 15 tables)

This paper contains 38 sections, 8 theorems, 41 equations, 15 figures, 15 tables.

Key Result

Lemma B.1

Consider any policy on $\mathcal{S}^*$ that assigns nonzero probability to all actions (i.e., any valid behavioral policy). Let $s, s' \in \mathcal{S}^*$ and $e, e' \in \mathcal{E}^*$. If $(s',e')$ is reachable from $(s,e)$, then $(s,e)$ is reachable from $(s',e')$. Consequentially, if $(s',e')$ is

Figures (15)

  • Figure 1: Probabilistic graphical model of the Ex-BMDP transition dynamics, as described in Section \ref{['sec:exbmdp']}. Endogenous states $s_t$ are shown as squares to indicate that they are deterministic functions of the previous endogenous states and actions. Observations $x_t$ and actions $a_t$ are shown in gray to indicate that they are observable. We do not show dependencies that may determine the actions $a_t$.
  • Figure 2: A tabular example where our proposed method ACDF successfully learns a control-endogenous state encoder, while the multistep-inverse method AC-State fails. (A) Full dynamics of the example Ex-BMDP: observed states are $\mathcal{X}=\{a,b...,j\}$ and actions are 'L' and 'R.' Transitions are stochastic: numbers in parentheses after action labels on transitions represent the probability of that transition, conditioned on the action. (B) Encoded latent states $\phi(x) \in \mathcal{S}$, where $\phi$ is the encoder learned using our proposed method, "ACDF." For example, $\phi$ maps the observed states $b$ and $g$ to the same latent state in $\mathcal{S}$. (C) Dynamics on the encoded latent states $\mathcal{S}$. The dynamics are deterministic, and capture the full agent-controllable factor of the state. Once $\phi$ is learned, these dynamics can be inferred from transition data by simple counting. The agent-independent exogenous dynamics are shown in the inset: these dynamics are not learned by our method. (D) Encoded latent states produced by the encoder $\phi$ output by the AC-State algorithm lamb2022guaranteed. (E) The encoded latent states learned by AC-State are incorrect: the encoding conflates states with different forward dynamics, resulting in under-determined transitions between latent states.
  • Figure 3: A. Example of the witness distance$W(a,b)$. B-D. Witness distance can be greater than D, leading AC-State to fail. E-F. Witness distance can be infinite if the dynamics are periodic, which also leads to AC-State failures. (See text of Section \ref{['sec:witness_dist']}.)
  • Figure 4: Results of numerical simulation experiments. Four environments are tested, with the dynamics given in the first two columns. For each environment, $|\mathcal{X}| = 10$, and $\mathcal{X}$ is isomorphic to $\mathcal{S} \times \mathcal{E}$. In the last two columns, we show the success rate of each method (AC-State and ACDF) at learning the correct endogenous dynamics over 50 simulations. We show this success rate as a function of the hyperparameter $K$ and the number of environment steps used for learning.
  • Figure 5: Full Ex-BMDP model of example in Section \ref{['sec:no_indep']}
  • ...and 10 more figures

Theorems & Definitions (15)

  • Lemma B.1
  • proof
  • Theorem D.1
  • proof
  • Theorem : Theorem of Schur
  • Lemma D.2
  • proof
  • Proposition D.3
  • proof
  • Proposition D.4
  • ...and 5 more