Generalizing Multi-Step Inverse Models for Representation Learning to Finite-Memory POMDPs

Lili Wu; Ben Evans; Riashat Islam; Raihan Seraj; Yonathan Efroni; Alex Lamb

Generalizing Multi-Step Inverse Models for Representation Learning to Finite-Memory POMDPs

Lili Wu, Ben Evans, Riashat Islam, Raihan Seraj, Yonathan Efroni, Alex Lamb

TL;DR

The paper tackles learning agent-centric representations in finite-memory POMDPs by extending multi-step inverse models. It introduces Masked Inverse Kinematics with Actions (MIK+A), provides Bayes-optimal characterizations for several inverse-objectives, and proves that MIK+A can recover the full agent-centric state under past- and future-decodability with rollouts up to length $m+n+D$. Theoretical results are complemented by experiments in partially observed navigation and offline visual RL, where MIK+A consistently outperforms baselines and demonstrates robustness to partial observability and exogenous noise. This work advances reward-free representation learning in non-Markovian, high-dimensional environments and supports improved downstream task performance through reliable latent state discovery.

Abstract

Discovering an informative, or agent-centric, state representation that encodes only the relevant information while discarding the irrelevant is a key challenge towards scaling reinforcement learning algorithms and efficiently applying them to downstream tasks. Prior works studied this problem in high-dimensional Markovian environments, when the current observation may be a complex object but is sufficient to decode the informative state. In this work, we consider the problem of discovering the agent-centric state in the more challenging high-dimensional non-Markovian setting, when the state can be decoded from a sequence of past observations. We establish that generalized inverse models can be adapted for learning agent-centric state representation for this task. Our results include asymptotic theory in the deterministic dynamics setting as well as counter-examples for alternative intuitive algorithms. We complement these findings with a thorough empirical study on the agent-centric state discovery abilities of the different alternatives we put forward. Particularly notable is our analysis of past actions, where we show that these can be a double-edged sword: making the algorithms more successful when used correctly and causing dramatic failure when used incorrectly.

Generalizing Multi-Step Inverse Models for Representation Learning to Finite-Memory POMDPs

TL;DR

. Theoretical results are complemented by experiments in partially observed navigation and offline visual RL, where MIK+A consistently outperforms baselines and demonstrates robustness to partial observability and exogenous noise. This work advances reward-free representation learning in non-Markovian, high-dimensional environments and supports improved downstream task performance through reliable latent state discovery.

Abstract

Paper Structure (24 sections, 1 theorem, 12 equations, 14 figures, 7 tables)

This paper contains 24 sections, 1 theorem, 12 equations, 14 figures, 7 tables.

Introduction
Background and Preliminaries
Proposed Objectives
The Bayes Optimal Classifier of Candidate Objectives
Discovering the Complete Agent-Centric State
Experimental Results
Discovering State from Partially-Observed Navigation Environments
Visual Offline RL with Additional Partial Observability
Related Work
Discussion
Broader Impact
Theory Details
Structural Lemma
Proof that the MIK+A objective has the right Bayes optimal classifier
Proof that All-History (AH) reduces to one-step inverse model
...and 9 more sections

Key Result

Lemma 1

Let $\mu$ be the initial distribution. Assume that the agent-centric and exogenous part is decoupled for the initial distribution, $\mu(s,\xi) =\mu(s)\mu(\xi)$, and that $\pi$ is an endogenous policy. Then, for any $t\geq 1$ it holds that $\mathbb{P}_\pi(o' \mid o,a,h) = q(o' \mid s',\xi') \mathbb{P

Figures (14)

Figure 1: We examine several objectives for generalizing inverse kinematics to FM-POMDPs. MIK+A uses past-decodability and future-decodability with a gap of $k$ masked steps, FJ+A uses past-decodability with a gap of $k$ steps, while AH uses past-decodability over the entire sequence.
Figure 2: The Forward Jump objective fails in a counterexample where the observation can only be seen once every $m$ steps, preventing the use of $k \leq m$ inverse kinematics examples, whereas the inverse examples with $k>m$ provide no signal for separating the states.
Figure 3: Visualization of the four navigation environments. From left to right: no curtain, one curtain, three curtains, and first-person environments. All include some degree of partial observability.
Figure 4: We compare state estimation performance (higher is better) across our various proposed methods. We compare action-conditioned and action-free variants while also considering a self-prediction auxiliary loss and the maximum prediction span $K$. We omit FJ and FJ+A in the maximum $K=1$ case because of equivalence to AH and AH+A with a shorter history.
Figure 5: Illustration of the visual offline RL experiment setup, in presence of partial observability. We use a forward and backward sequence model (RNN encoder) to handle past and future observation sequences, to achieve latent state discovery in FM-POMDPs.
...and 9 more figures

Theorems & Definitions (1)

Lemma 1: Decoupling property for endogenous policies efroni2022provably

Generalizing Multi-Step Inverse Models for Representation Learning to Finite-Memory POMDPs

TL;DR

Abstract

Generalizing Multi-Step Inverse Models for Representation Learning to Finite-Memory POMDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (1)