Table of Contents
Fetching ...

Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff

Jian Qian, Haichen Hu, David Simchi-Levi

TL;DR

This work establishes a principled reduction from stochastic Contextual MDPs to offline density estimation under realizability, enabling oracle-efficient learning of CMDPs. The proposed LOLIPOP algorithm introduces layerwise exploration-exploitation via a policy cover and trusted occupancy measures, achieving near-optimal regret with only $O(H\log T)$ or $O(H\log\log T)$ offline-density-estimation calls. It generalizes to reward-free reinforcement learning for CMDPs, delivering near-optimal sample complexity with limited oracle interactions. The framework hinges on carefully designed components (IGW policy covers, trusted transitions, and offline-DE guarantees) to bridge CMDPs with offline estimation, offering practical efficiency and broad applicability. Overall, the paper broadens the frontier of offline-oracle-efficient learning for sequential decision problems with contexts, advancing beyond contextual bandits to general CMDPs.

Abstract

Motivated by the recent discovery of a statistical and computational reduction from contextual bandits to offline regression (Simchi-Levi and Xu, 2021), we address the general (stochastic) Contextual Markov Decision Process (CMDP) problem with horizon H (as known as CMDP with H layers). In this paper, we introduce a reduction from CMDPs to offline density estimation under the realizability assumption, i.e., a model class M containing the true underlying CMDP is provided in advance. We develop an efficient, statistically near-optimal algorithm requiring only O(HlogT) calls to an offline density estimation algorithm (or oracle) across all T rounds of interaction. This number can be further reduced to O(HloglogT) if T is known in advance. Our results mark the first efficient and near-optimal reduction from CMDPs to offline density estimation without imposing any structural assumptions on the model class. A notable feature of our algorithm is the design of a layerwise exploration-exploitation tradeoff tailored to address the layerwise structure of CMDPs. Additionally, our algorithm is versatile and applicable to pure exploration tasks in reward-free reinforcement learning.

Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff

TL;DR

This work establishes a principled reduction from stochastic Contextual MDPs to offline density estimation under realizability, enabling oracle-efficient learning of CMDPs. The proposed LOLIPOP algorithm introduces layerwise exploration-exploitation via a policy cover and trusted occupancy measures, achieving near-optimal regret with only or offline-density-estimation calls. It generalizes to reward-free reinforcement learning for CMDPs, delivering near-optimal sample complexity with limited oracle interactions. The framework hinges on carefully designed components (IGW policy covers, trusted transitions, and offline-DE guarantees) to bridge CMDPs with offline estimation, offering practical efficiency and broad applicability. Overall, the paper broadens the frontier of offline-oracle-efficient learning for sequential decision problems with contexts, advancing beyond contextual bandits to general CMDPs.

Abstract

Motivated by the recent discovery of a statistical and computational reduction from contextual bandits to offline regression (Simchi-Levi and Xu, 2021), we address the general (stochastic) Contextual Markov Decision Process (CMDP) problem with horizon H (as known as CMDP with H layers). In this paper, we introduce a reduction from CMDPs to offline density estimation under the realizability assumption, i.e., a model class M containing the true underlying CMDP is provided in advance. We develop an efficient, statistically near-optimal algorithm requiring only O(HlogT) calls to an offline density estimation algorithm (or oracle) across all T rounds of interaction. This number can be further reduced to O(HloglogT) if T is known in advance. Our results mark the first efficient and near-optimal reduction from CMDPs to offline density estimation without imposing any structural assumptions on the model class. A notable feature of our algorithm is the design of a layerwise exploration-exploitation tradeoff tailored to address the layerwise structure of CMDPs. Additionally, our algorithm is versatile and applicable to pure exploration tasks in reward-free reinforcement learning.
Paper Structure (30 sections, 15 theorems, 104 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 30 sections, 15 theorems, 104 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Theorem 1

If $T$ is known, then by choosing the epoch schedule $\tau_m = 2(T/H)^{1-2^{-m}}$ for $m\geq 1$ and the offline density estimation oracle $\mathrm{OffDE}_\mathcal{M} = \mathsf{MLE}_{\mathcal{M}}$, the outputs $\{\pi_t\}_{t\in [T]}$ of alg:mainalg satisfies that with probability at least $1-\delta$, with only $O(H\log\log T)$ number of oracle calls to the $\mathsf{MLE}_{\mathcal{M}}$ oracle for $\

Figures (1)

  • Figure 1: The dependence graph of the construction. The estimation $\widehat{M}_{m-1}=\{ \widehat{P}_{m-1}^h, \widehat{R}_{m-1}^h \}_{h\in [H]}$ from the previous round provides the optimal policy $\widehat{\pi}_{m-1}$ (line:optimal-policy) and the regret estimation $\reghatm[m-1]$ (line:policy-covering) for the construction of $\Pi_{m}^h, p_m^h$. The estimation $\widehat{P}_m^h, \widehat{R}_m^h$ is generated by calling the oracle $\mathrm{OffDE}_\mathcal{M}$ on the trajectories collected with policy kernel $\metapol$ (line:model-estimation). The trusted transitions and trusted occupancy measures $\widetilde{\mathcal{T}}_m^h,\widetilde{d}_m^{h+1}$ are computed from $\widetilde{d}_m^{h}, \widehat{P}_m^h$ (\ref{['def:trusted-transtion', 'def:trusted-occupancy-measure']}). The policy cover $\Pi_{m}^h$ is the union of $\pimhat[m-1][\cdot]$ and the policies $\{\pi^{h,s,a}_{m,\cdot}\}_{s,a\in \mathcal{S}\times\mathcal{A}}$ calculated in line:policy-covering which requires $\widetilde{\mathcal{T}}_m^{h-1},\widetilde{d}_m^{h}$. The policy kernel $p_m^h$ is the inverse gap weighting on $\Pi_{m}^h$ (line:igw).

Theorems & Definitions (18)

  • Definition 2.1: Offline density estimation oracle
  • Theorem 1
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • Definition 4.1
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • ...and 8 more