Provable Representation with Efficient Planning for Partial Observable Reinforcement Learning
Hongming Zhang, Tongzheng Ren, Chenjun Xiao, Dale Schuurmans, Bo Dai
TL;DR
The paper tackles reinforcement learning under partial observability by addressing POMDPs with an $L$-decodable structure. It introduces the Multi-step Latent Variable Representation ($\mu$LV-Rep), which yields a linear $Q^\pi$ representation in terms of $p(\cdot|x_h,a_h)$ and enables tractable planning without explicit belief computation. The authors develop a variational latent-variable learning approach to fit $p(\cdot|x_h,a_h)$, combine it with planning (e.g., SAC for continuous actions) and exploration bonuses, and prove PAC-like sample complexity guarantees under realizability and RKHS assumptions. Empirically, $\mu$LV-Rep achieves superior performance on visual, partially observable robotic tasks (Meta-World and partial-observable MuJoCo) compared to strong baselines, approaching or surpassing fully-observed performance in many tasks. The work demonstrates that carefully structured latent representations can yield provably efficient and practical RL for partially observable environments.
Abstract
In most real-world reinforcement learning applications, state information is only partially observable, which breaks the Markov decision process assumption and leads to inferior performance for algorithms that conflate observations with state. Partially Observable Markov Decision Processes (POMDPs), on the other hand, provide a general framework that allows for partial observability to be accounted for in learning, exploration and planning, but presents significant computational and statistical challenges. To address these difficulties, we develop a representation-based perspective that leads to a coherent framework and tractable algorithmic approach for practical reinforcement learning from partial observations. We provide a theoretical analysis for justifying the statistical efficiency of the proposed algorithm, and also empirically demonstrate the proposed algorithm can surpass state-of-the-art performance with partial observations across various benchmarks, advancing reliable reinforcement learning towards more practical applications.
