Table of Contents
Fetching ...

Generalizing Multi-Step Inverse Models for Representation Learning to Finite-Memory POMDPs

Lili Wu, Ben Evans, Riashat Islam, Raihan Seraj, Yonathan Efroni, Alex Lamb

TL;DR

The paper tackles learning agent-centric representations in finite-memory POMDPs by extending multi-step inverse models. It introduces Masked Inverse Kinematics with Actions (MIK+A), provides Bayes-optimal characterizations for several inverse-objectives, and proves that MIK+A can recover the full agent-centric state under past- and future-decodability with rollouts up to length $m+n+D$. Theoretical results are complemented by experiments in partially observed navigation and offline visual RL, where MIK+A consistently outperforms baselines and demonstrates robustness to partial observability and exogenous noise. This work advances reward-free representation learning in non-Markovian, high-dimensional environments and supports improved downstream task performance through reliable latent state discovery.

Abstract

Discovering an informative, or agent-centric, state representation that encodes only the relevant information while discarding the irrelevant is a key challenge towards scaling reinforcement learning algorithms and efficiently applying them to downstream tasks. Prior works studied this problem in high-dimensional Markovian environments, when the current observation may be a complex object but is sufficient to decode the informative state. In this work, we consider the problem of discovering the agent-centric state in the more challenging high-dimensional non-Markovian setting, when the state can be decoded from a sequence of past observations. We establish that generalized inverse models can be adapted for learning agent-centric state representation for this task. Our results include asymptotic theory in the deterministic dynamics setting as well as counter-examples for alternative intuitive algorithms. We complement these findings with a thorough empirical study on the agent-centric state discovery abilities of the different alternatives we put forward. Particularly notable is our analysis of past actions, where we show that these can be a double-edged sword: making the algorithms more successful when used correctly and causing dramatic failure when used incorrectly.

Generalizing Multi-Step Inverse Models for Representation Learning to Finite-Memory POMDPs

TL;DR

The paper tackles learning agent-centric representations in finite-memory POMDPs by extending multi-step inverse models. It introduces Masked Inverse Kinematics with Actions (MIK+A), provides Bayes-optimal characterizations for several inverse-objectives, and proves that MIK+A can recover the full agent-centric state under past- and future-decodability with rollouts up to length . Theoretical results are complemented by experiments in partially observed navigation and offline visual RL, where MIK+A consistently outperforms baselines and demonstrates robustness to partial observability and exogenous noise. This work advances reward-free representation learning in non-Markovian, high-dimensional environments and supports improved downstream task performance through reliable latent state discovery.

Abstract

Discovering an informative, or agent-centric, state representation that encodes only the relevant information while discarding the irrelevant is a key challenge towards scaling reinforcement learning algorithms and efficiently applying them to downstream tasks. Prior works studied this problem in high-dimensional Markovian environments, when the current observation may be a complex object but is sufficient to decode the informative state. In this work, we consider the problem of discovering the agent-centric state in the more challenging high-dimensional non-Markovian setting, when the state can be decoded from a sequence of past observations. We establish that generalized inverse models can be adapted for learning agent-centric state representation for this task. Our results include asymptotic theory in the deterministic dynamics setting as well as counter-examples for alternative intuitive algorithms. We complement these findings with a thorough empirical study on the agent-centric state discovery abilities of the different alternatives we put forward. Particularly notable is our analysis of past actions, where we show that these can be a double-edged sword: making the algorithms more successful when used correctly and causing dramatic failure when used incorrectly.
Paper Structure (24 sections, 1 theorem, 12 equations, 14 figures, 7 tables)

This paper contains 24 sections, 1 theorem, 12 equations, 14 figures, 7 tables.

Key Result

Lemma 1

Let $\mu$ be the initial distribution. Assume that the agent-centric and exogenous part is decoupled for the initial distribution, $\mu(s,\xi) =\mu(s)\mu(\xi)$, and that $\pi$ is an endogenous policy. Then, for any $t\geq 1$ it holds that $\mathbb{P}_\pi(o' \mid o,a,h) = q(o' \mid s',\xi') \mathbb{P

Figures (14)

  • Figure 1: We examine several objectives for generalizing inverse kinematics to FM-POMDPs. MIK+A uses past-decodability and future-decodability with a gap of $k$ masked steps, FJ+A uses past-decodability with a gap of $k$ steps, while AH uses past-decodability over the entire sequence.
  • Figure 2: The Forward Jump objective fails in a counterexample where the observation can only be seen once every $m$ steps, preventing the use of $k \leq m$ inverse kinematics examples, whereas the inverse examples with $k>m$ provide no signal for separating the states.
  • Figure 3: Visualization of the four navigation environments. From left to right: no curtain, one curtain, three curtains, and first-person environments. All include some degree of partial observability.
  • Figure 4: We compare state estimation performance (higher is better) across our various proposed methods. We compare action-conditioned and action-free variants while also considering a self-prediction auxiliary loss and the maximum prediction span $K$. We omit FJ and FJ+A in the maximum $K=1$ case because of equivalence to AH and AH+A with a shorter history.
  • Figure 5: Illustration of the visual offline RL experiment setup, in presence of partial observability. We use a forward and backward sequence model (RNN encoder) to handle past and future observation sequences, to achieve latent state discovery in FM-POMDPs.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Lemma 1: Decoupling property for endogenous policies efroni2022provably