Near-Optimal Partially Observable Reinforcement Learning with Partial Online State Information
Ming Shi, Yingbin Liang, Ness B. Shroff
TL;DR
This paper investigates learning in partially observable MDPs when the learner has access to partial online state information (POSI). It proves a fundamental hardness result: without full OSI, learning an $oldsymbol{ ilde{ ext{epsilon}}}$-optimal policy can require exponential sample complexity in the horizon. It then identifies two tractable POSI subclasses and develops algorithms with provable sublinear regret, notably achieving $ ilde{O}(\, ext{sqrt}(K)ig)$-type regret that improves with the amount of information exposed ($ ilde{d}$). The work introduces a query-aware operator framework and a two-layer learning architecture (PDOL/OMLE-POSI) that jointly optimize information acquisition and control under POSI and, in doing so, clarifies when POSI suffices to yield efficient reinforcement learning in POMDPs. These results provide principled guidance for jointly designing sensing (POSI queries) and control in real-world systems with sensing constraints, such as wireless networks and autonomous robotics.
Abstract
Partially observable Markov decision processes (POMDPs) are a general framework for sequential decision-making under latent state uncertainty, yet learning in POMDPs is intractable in the worst case. Motivated by sensing and probing constraints in practice, we study how much online state information (OSI) is sufficient to enable efficient learning guarantees. We formalize a model in which the learner can query only partial OSI (POSI) during interaction. We first prove an information-theoretic hardness result showing that, for general POMDPs, achieving an $ε$-optimal policy can require sample complexity that is exponential unless full OSI is available. We then identify two structured subclasses that remain learnable under POSI and propose corresponding algorithms with provably efficient performance guarantees. In particular, we establish regret upper bounds with $\tilde{O}(\sqrt{K})$ dependence on the number of episodes $K$, together with complementary lower bounds, thereby delineating when POSI suffices for efficient reinforcement learning. Our results highlight a principled separation between intractable and tractable regimes under incomplete online state access and provide new tools for jointly optimizing POSI queries and learning control actions.
