Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff
Jian Qian, Haichen Hu, David Simchi-Levi
TL;DR
This work establishes a principled reduction from stochastic Contextual MDPs to offline density estimation under realizability, enabling oracle-efficient learning of CMDPs. The proposed LOLIPOP algorithm introduces layerwise exploration-exploitation via a policy cover and trusted occupancy measures, achieving near-optimal regret with only $O(H\log T)$ or $O(H\log\log T)$ offline-density-estimation calls. It generalizes to reward-free reinforcement learning for CMDPs, delivering near-optimal sample complexity with limited oracle interactions. The framework hinges on carefully designed components (IGW policy covers, trusted transitions, and offline-DE guarantees) to bridge CMDPs with offline estimation, offering practical efficiency and broad applicability. Overall, the paper broadens the frontier of offline-oracle-efficient learning for sequential decision problems with contexts, advancing beyond contextual bandits to general CMDPs.
Abstract
Motivated by the recent discovery of a statistical and computational reduction from contextual bandits to offline regression (Simchi-Levi and Xu, 2021), we address the general (stochastic) Contextual Markov Decision Process (CMDP) problem with horizon H (as known as CMDP with H layers). In this paper, we introduce a reduction from CMDPs to offline density estimation under the realizability assumption, i.e., a model class M containing the true underlying CMDP is provided in advance. We develop an efficient, statistically near-optimal algorithm requiring only O(HlogT) calls to an offline density estimation algorithm (or oracle) across all T rounds of interaction. This number can be further reduced to O(HloglogT) if T is known in advance. Our results mark the first efficient and near-optimal reduction from CMDPs to offline density estimation without imposing any structural assumptions on the model class. A notable feature of our algorithm is the design of a layerwise exploration-exploitation tradeoff tailored to address the layerwise structure of CMDPs. Additionally, our algorithm is versatile and applicable to pure exploration tasks in reward-free reinforcement learning.
