Table of Contents
Fetching ...

Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes

Miao Lu, Yifei Min, Zhaoran Wang, Zhuoran Yang

TL;DR

This work addresses offline reinforcement learning in partially observable environments where latent states confound actions and observations. It introduces P3O, a framework that uses proximal causal inference to identify policy values via confounding bridge functions and employs minimax estimation with pessimism to cope with distributional shift under partial data coverage. Theoretical results establish $n^{-1/2}$-suboptimality for general function classes and $ ilde{O}(\sqrt{H^3 d / n})$ suboptimality under linear function approximation, marking the first provably efficient offline RL method for confounded POMDPs. The approach has potential implications for domains like precision medicine and autonomous systems where offline data are plentiful but latent factors complicate learning.

Abstract

We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.

Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes

TL;DR

This work addresses offline reinforcement learning in partially observable environments where latent states confound actions and observations. It introduces P3O, a framework that uses proximal causal inference to identify policy values via confounding bridge functions and employs minimax estimation with pessimism to cope with distributional shift under partial data coverage. Theoretical results establish -suboptimality for general function classes and suboptimality under linear function approximation, marking the first provably efficient offline RL method for confounded POMDPs. The approach has potential implications for domains like precision medicine and autonomous systems where offline data are plentiful but latent factors complicate learning.

Abstract

We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a -suboptimality, where is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
Paper Structure (41 sections, 13 theorems, 202 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 41 sections, 13 theorems, 202 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.8

Under Assumption assump: negative control cond independence and assump: bridge functions exist, for any history-dependent policy $\pi\in\Pi(\mathcal{H})$, it holds that

Figures (4)

  • Figure 1: Causal graph of the data generating process for offline learning in POMDP. The dotted nodes indicate that these variables are not stored in the offline dataset. Here $S_{h}$ is the state of the environment at step $h$. Besides, $A_{h}$, $R_h$, $O_h$ are the action, immediate reward, and observation at the $h$-th step, respectively. These variables are stored in the offline dataset and thus are represented by black solid circles. We use the solid arrows to indicate the dependency among the variables. In specific, the action $A_h$ is specified by the behavior policy which is a function of $S_h$, such dependency is depicted in the blue arrows. Moreover, both the observations and rewards depends on the state $S_h$ and such dependency is depicted in red. We would like to highlight that $S_h$ affects both $A_h$ and $O_h$ and thus serves as an unobserved confounder.
  • Figure 2: Causal graph for reactive policy. The dotted nodes indicate that the variables are not stored in the offline dataset. Solid arrows indicate the dependency among the variables. Specifically, The red arrows depict the dependence of the target policy on the observable variables. The blue arrows depict the dependence of the behavior policy on the latent state. The negative control action and outcome variables at the $h$-th step are filled in green and yellow, respectively.
  • Figure 3: Causal graph for finite-length history policy. Index $l=\max\{1,h-k\}$. The dotted nodes indicate that the variables are not stored in the offline dataset. Solid arrows indicate the dependency among the variables. Specifically, The red arrows depict the dependence of the target policy on the observable variables. The blue arrows depict the dependence of the behavior policy on the latent state. The negative control action and outcome variables at step $h$ are filled in green and yellow, respectively.
  • Figure 4: Causal graph for full-length history policy. The dotted nodes indicate that the variables are not stored in the offline dataset. Solid arrows indicate the dependency among the variables. Specifically, The red arrows depict the dependence of the target policy on the observable variables. The blue arrows depict the dependence of the behavior policy on the latent state. The negative control action and outcome variables at step $h$ are filled in green and yellow, respectively.

Theorems & Definitions (23)

  • Example 2.1: Reactive policy
  • Example 2.2: Finite-history policy
  • Example 2.3: Full-history policy
  • Example 3.2: Example \ref{['example: reactive']} revisited
  • Example 3.3: Example \ref{['example: finite history']} revisited
  • Example 3.4: Example \ref{['example: full history']} revisited
  • Example 3.6: Example \ref{['example: reactive']} revisited
  • Example 3.7: Example \ref{['example: finite history']} revisited
  • Theorem 3.8: Identification of policy value
  • Theorem 4.4: Suboptimality
  • ...and 13 more