RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation

Jeongyeol Kwon; Shie Mannor; Constantine Caramanis; Yonathan Efroni

RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation

Jeongyeol Kwon, Shie Mannor, Constantine Caramanis, Yonathan Efroni

TL;DR

This work tackles online reinforcement learning in Latent MDPs (LMDPs), where an unseen latent context selects one of M MDPs at episode start. The authors introduce LMDP-OMLE, an optimistic, sample-efficient online algorithm built around a novel off-policy evaluation (OPE) framework tailored for LMDPs and a latent coverage coefficient that quantifies how well a segmented policy covers a target policy across latent contexts. A key theoretical contribution is a TV-distance bound for trajectory distributions under LMDPs expressed via the segmented-policy coverage, together with a sufficiency result that restricts attention to memoryless policies, enabling a practical coverage-doubling argument. The resulting analysis yields near-optimal guarantees up to polynomial factors and argues that OPE-based perspectives can generalize exploration methods beyond LMDPs to broader partially observed interactive learning problems.

Abstract

In many real-world decision problems there is partially observed, hidden or latent information that remains fixed throughout an interaction. Such decision problems can be modeled as Latent Markov Decision Processes (LMDPs), where a latent variable is selected at the beginning of an interaction and is not disclosed to the agent. In the last decade, there has been significant progress in solving LMDPs under different structural assumptions. However, for general LMDPs, there is no known learning algorithm that provably matches the existing lower bound (Kwon et al., 2021). We introduce the first sample-efficient algorithm for LMDPs without any additional structural assumptions. Our result builds off a new perspective on the role of off-policy evaluation guarantees and coverage coefficients in LMDPs, a perspective, that has been overlooked in the context of exploration in partially observed environments. Specifically, we establish a novel off-policy evaluation lemma and introduce a new coverage coefficient for LMDPs. Then, we show how these can be used to derive near-optimal guarantees of an optimistic exploration algorithm. These results, we believe, can be valuable for a wide range of interactive learning problems beyond LMDPs, and especially, for partially observed environments.

RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation

TL;DR

Abstract

Paper Structure (50 sections, 12 theorems, 103 equations, 1 figure, 1 table, 2 algorithms)

This paper contains 50 sections, 12 theorems, 103 equations, 1 figure, 1 table, 2 algorithms.

Introduction
Technical Challenges
Challenge 1: Limitation of Existing POMDP Algorithms.
Challenge 2: Limitation of Existing LMDP Algorithms.
Challenge 3: Limitation of Existing Complexity Measures in RL.
Overview of Our Contribution
Preliminaries
Notation
New Perspective on OMLE: Online Guarantees via Off-Policy Evaluation
Exploration in LMDPs via Sufficient Coverage
Intuition from moment-exploration algorithm in kwon2023reward.
Off-Policy Evaluation in LMDPs
Coverage Doubling via Sufficiency of Memoryless Polices
The LMDP-OMLE Algorithm
Proof Sketch
...and 35 more sections

Key Result

Lemma 3.1

For any behavioral and target policies $\psi,\pi \in \Pi$, let the coverage coefficient be defined by: For any two models $\theta,\theta^* \in \Theta$, the TV distance between trajectory distributions following a target policy $\pi \in \Pi$ is bounded as follows:

Figures (1)

Figure 1: Highlevel description of LMDP-OMLE. In the online phase, we find a new test policy under which models in the confidence set do not agree. Then the exploration policy is constructed with our new notion of segmentation of policies within $\Psi_{\texttt{test}}$ that are executed throughout. In the offline phase, we add the batched sample trajectories to dataset and update the confidence set of models.

Theorems & Definitions (16)

Definition 2.1: Latent Markov Decision Process (LMDP)
Lemma 3.1: TV Bound via OPE for MDPs
Lemma 3.2: Coverage Multiplicative Increase
Theorem 3.3
Definition 4.1: LMDP Coverage Coefficient
Lemma 4.2: TV Bound via OPE for LMDPs
Remark 4.3: Why is single latent-state coverability coefficient not enough?
Lemma 4.4: Sufficiency of Memoryless Polices for LMDPs
Theorem 4.5: Sample Complexity of LMDP-OMLE
Lemma 5.1
...and 6 more

RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation

TL;DR

Abstract

RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (16)