Table of Contents
Fetching ...

RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation

Jeongyeol Kwon, Shie Mannor, Constantine Caramanis, Yonathan Efroni

TL;DR

This work tackles online reinforcement learning in Latent MDPs (LMDPs), where an unseen latent context selects one of M MDPs at episode start. The authors introduce LMDP-OMLE, an optimistic, sample-efficient online algorithm built around a novel off-policy evaluation (OPE) framework tailored for LMDPs and a latent coverage coefficient that quantifies how well a segmented policy covers a target policy across latent contexts. A key theoretical contribution is a TV-distance bound for trajectory distributions under LMDPs expressed via the segmented-policy coverage, together with a sufficiency result that restricts attention to memoryless policies, enabling a practical coverage-doubling argument. The resulting analysis yields near-optimal guarantees up to polynomial factors and argues that OPE-based perspectives can generalize exploration methods beyond LMDPs to broader partially observed interactive learning problems.

Abstract

In many real-world decision problems there is partially observed, hidden or latent information that remains fixed throughout an interaction. Such decision problems can be modeled as Latent Markov Decision Processes (LMDPs), where a latent variable is selected at the beginning of an interaction and is not disclosed to the agent. In the last decade, there has been significant progress in solving LMDPs under different structural assumptions. However, for general LMDPs, there is no known learning algorithm that provably matches the existing lower bound (Kwon et al., 2021). We introduce the first sample-efficient algorithm for LMDPs without any additional structural assumptions. Our result builds off a new perspective on the role of off-policy evaluation guarantees and coverage coefficients in LMDPs, a perspective, that has been overlooked in the context of exploration in partially observed environments. Specifically, we establish a novel off-policy evaluation lemma and introduce a new coverage coefficient for LMDPs. Then, we show how these can be used to derive near-optimal guarantees of an optimistic exploration algorithm. These results, we believe, can be valuable for a wide range of interactive learning problems beyond LMDPs, and especially, for partially observed environments.

RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation

TL;DR

This work tackles online reinforcement learning in Latent MDPs (LMDPs), where an unseen latent context selects one of M MDPs at episode start. The authors introduce LMDP-OMLE, an optimistic, sample-efficient online algorithm built around a novel off-policy evaluation (OPE) framework tailored for LMDPs and a latent coverage coefficient that quantifies how well a segmented policy covers a target policy across latent contexts. A key theoretical contribution is a TV-distance bound for trajectory distributions under LMDPs expressed via the segmented-policy coverage, together with a sufficiency result that restricts attention to memoryless policies, enabling a practical coverage-doubling argument. The resulting analysis yields near-optimal guarantees up to polynomial factors and argues that OPE-based perspectives can generalize exploration methods beyond LMDPs to broader partially observed interactive learning problems.

Abstract

In many real-world decision problems there is partially observed, hidden or latent information that remains fixed throughout an interaction. Such decision problems can be modeled as Latent Markov Decision Processes (LMDPs), where a latent variable is selected at the beginning of an interaction and is not disclosed to the agent. In the last decade, there has been significant progress in solving LMDPs under different structural assumptions. However, for general LMDPs, there is no known learning algorithm that provably matches the existing lower bound (Kwon et al., 2021). We introduce the first sample-efficient algorithm for LMDPs without any additional structural assumptions. Our result builds off a new perspective on the role of off-policy evaluation guarantees and coverage coefficients in LMDPs, a perspective, that has been overlooked in the context of exploration in partially observed environments. Specifically, we establish a novel off-policy evaluation lemma and introduce a new coverage coefficient for LMDPs. Then, we show how these can be used to derive near-optimal guarantees of an optimistic exploration algorithm. These results, we believe, can be valuable for a wide range of interactive learning problems beyond LMDPs, and especially, for partially observed environments.
Paper Structure (50 sections, 12 theorems, 103 equations, 1 figure, 1 table, 2 algorithms)

This paper contains 50 sections, 12 theorems, 103 equations, 1 figure, 1 table, 2 algorithms.

Key Result

Lemma 3.1

For any behavioral and target policies $\psi,\pi \in \Pi$, let the coverage coefficient be defined by: For any two models $\theta,\theta^* \in \Theta$, the TV distance between trajectory distributions following a target policy $\pi \in \Pi$ is bounded as follows:

Figures (1)

  • Figure 1: Highlevel description of LMDP-OMLE. In the online phase, we find a new test policy under which models in the confidence set do not agree. Then the exploration policy is constructed with our new notion of segmentation of policies within $\Psi_{\texttt{test}}$ that are executed throughout. In the offline phase, we add the batched sample trajectories to dataset and update the confidence set of models.

Theorems & Definitions (16)

  • Definition 2.1: Latent Markov Decision Process (LMDP)
  • Lemma 3.1: TV Bound via OPE for MDPs
  • Lemma 3.2: Coverage Multiplicative Increase
  • Theorem 3.3
  • Definition 4.1: LMDP Coverage Coefficient
  • Lemma 4.2: TV Bound via OPE for LMDPs
  • Remark 4.3: Why is single latent-state coverability coefficient not enough?
  • Lemma 4.4: Sufficiency of Memoryless Polices for LMDPs
  • Theorem 4.5: Sample Complexity of LMDP-OMLE
  • Lemma 5.1
  • ...and 6 more