Near-Optimal Partially Observable Reinforcement Learning with Partial Online State Information

Ming Shi; Yingbin Liang; Ness B. Shroff

Near-Optimal Partially Observable Reinforcement Learning with Partial Online State Information

Ming Shi, Yingbin Liang, Ness B. Shroff

TL;DR

This paper investigates learning in partially observable MDPs when the learner has access to partial online state information (POSI). It proves a fundamental hardness result: without full OSI, learning an $oldsymbol{ ilde{ ext{epsilon}}}$-optimal policy can require exponential sample complexity in the horizon. It then identifies two tractable POSI subclasses and develops algorithms with provable sublinear regret, notably achieving $ ilde{O}(\, ext{sqrt}(K)ig)$-type regret that improves with the amount of information exposed ($ ilde{d}$). The work introduces a query-aware operator framework and a two-layer learning architecture (PDOL/OMLE-POSI) that jointly optimize information acquisition and control under POSI and, in doing so, clarifies when POSI suffices to yield efficient reinforcement learning in POMDPs. These results provide principled guidance for jointly designing sensing (POSI queries) and control in real-world systems with sensing constraints, such as wireless networks and autonomous robotics.

Abstract

Partially observable Markov decision processes (POMDPs) are a general framework for sequential decision-making under latent state uncertainty, yet learning in POMDPs is intractable in the worst case. Motivated by sensing and probing constraints in practice, we study how much online state information (OSI) is sufficient to enable efficient learning guarantees. We formalize a model in which the learner can query only partial OSI (POSI) during interaction. We first prove an information-theoretic hardness result showing that, for general POMDPs, achieving an $ε$-optimal policy can require sample complexity that is exponential unless full OSI is available. We then identify two structured subclasses that remain learnable under POSI and propose corresponding algorithms with provably efficient performance guarantees. In particular, we establish regret upper bounds with $\tilde{O}(\sqrt{K})$ dependence on the number of episodes $K$, together with complementary lower bounds, thereby delineating when POSI suffices for efficient reinforcement learning. Our results highlight a principled separation between intractable and tractable regimes under incomplete online state access and provide new tools for jointly optimizing POSI queries and learning control actions.

Near-Optimal Partially Observable Reinforcement Learning with Partial Online State Information

TL;DR

-optimal policy can require exponential sample complexity in the horizon. It then identifies two tractable POSI subclasses and develops algorithms with provable sublinear regret, notably achieving

-type regret that improves with the amount of information exposed (

). The work introduces a query-aware operator framework and a two-layer learning architecture (PDOL/OMLE-POSI) that jointly optimize information acquisition and control under POSI and, in doing so, clarifies when POSI suffices to yield efficient reinforcement learning in POMDPs. These results provide principled guidance for jointly designing sensing (POSI queries) and control in real-world systems with sensing constraints, such as wireless networks and autonomous robotics.

Abstract

-optimal policy can require sample complexity that is exponential unless full OSI is available. We then identify two structured subclasses that remain learnable under POSI and propose corresponding algorithms with provably efficient performance guarantees. In particular, we establish regret upper bounds with

dependence on the number of episodes

, together with complementary lower bounds, thereby delineating when POSI suffices for efficient reinforcement learning. Our results highlight a principled separation between intractable and tractable regimes under incomplete online state access and provide new tools for jointly optimizing POSI queries and learning control actions.

Paper Structure (122 sections, 15 theorems, 170 equations, 1 figure, 1 table, 3 algorithms)

This paper contains 122 sections, 15 theorems, 170 equations, 1 figure, 1 table, 3 algorithms.

Introduction
Contributions
Related Work on POMDPs
Problem Formulation
Traditional Episodic POMDP
Partial Online State Information (POSI)
Motivating example 1 (Wireless channel scheduling)
Motivating example 2 (Autonomous driving)
Performance Metric
Fundamental Hardness Without Full Online State Information
Fundamental Tractability under POSI and Partial Noisy Observations
Connection to Multi-Step Weakly Revealing POMDPs and POSI-Specific Novelties
Connection to bounded-delay (two-step/multi-step) revealing POMDPs via augmentation (queries as actions)
POSI-specific novelties beyond classical revealing POMDPs
A Provably Efficient Algorithm
...and 107 more sections

Key Result

Theorem 1

(Intractability without full OSI). Fix any $H\ge 2$, $A\ge 2$, and any $(d,\tilde{d})$ with $1\le \tilde{d}<d$. For the POSI model in subsec:formulatestructuredstate, there exist instances such that the following holds: for any (possibly randomized) learning algorithm that, after $K$ episodes, outpu

Figures (1)

Figure 1: One-step interaction sketch for the traditional POMDP and the two POSI subclasses studied in this paper

Theorems & Definitions (36)

Remark 1
Theorem 1
Remark 2
proof : Proof sketch
Theorem 2
proof : Proof sketch of \ref{['theorem:regretpomle']}
Theorem 3: Lower bound with explicit $\tilde{d}$-dependence
proof : Proof sketch (why $\vert \tilde{\mathbb{S}} \vert^{(d-\tilde{d})/2}$ is unavoidable)
Remark 3: On the role of the revealing/conditioning parameter $\alpha$
Theorem 4: Lower bound for unavoidable dependence on $d$ (via tabular MDPs)
...and 26 more

Near-Optimal Partially Observable Reinforcement Learning with Partial Online State Information

TL;DR

Abstract

Near-Optimal Partially Observable Reinforcement Learning with Partial Online State Information

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (36)