Generalizing Multi-Step Inverse Models for Representation Learning to Finite-Memory POMDPs
Lili Wu, Ben Evans, Riashat Islam, Raihan Seraj, Yonathan Efroni, Alex Lamb
TL;DR
The paper tackles learning agent-centric representations in finite-memory POMDPs by extending multi-step inverse models. It introduces Masked Inverse Kinematics with Actions (MIK+A), provides Bayes-optimal characterizations for several inverse-objectives, and proves that MIK+A can recover the full agent-centric state under past- and future-decodability with rollouts up to length $m+n+D$. Theoretical results are complemented by experiments in partially observed navigation and offline visual RL, where MIK+A consistently outperforms baselines and demonstrates robustness to partial observability and exogenous noise. This work advances reward-free representation learning in non-Markovian, high-dimensional environments and supports improved downstream task performance through reliable latent state discovery.
Abstract
Discovering an informative, or agent-centric, state representation that encodes only the relevant information while discarding the irrelevant is a key challenge towards scaling reinforcement learning algorithms and efficiently applying them to downstream tasks. Prior works studied this problem in high-dimensional Markovian environments, when the current observation may be a complex object but is sufficient to decode the informative state. In this work, we consider the problem of discovering the agent-centric state in the more challenging high-dimensional non-Markovian setting, when the state can be decoded from a sequence of past observations. We establish that generalized inverse models can be adapted for learning agent-centric state representation for this task. Our results include asymptotic theory in the deterministic dynamics setting as well as counter-examples for alternative intuitive algorithms. We complement these findings with a thorough empirical study on the agent-centric state discovery abilities of the different alternatives we put forward. Particularly notable is our analysis of past actions, where we show that these can be a double-edged sword: making the algorithms more successful when used correctly and causing dramatic failure when used incorrectly.
