Table of Contents
Fetching ...

Provable Offline Reinforcement Learning for Structured Cyclic MDPs

Kyungbok Lee, Angelica Cristello Sarteau, Michael R. Kosorok

TL;DR

This work introduces a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle, and proposes CycleFQI, an extension of fitted Q-iteration enabling theoretical analysis and interpretation.

Abstract

We introduce a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle. In this setting, offline learning is challenging: optimizing a policy at any stage shifts the state distributions of subsequent stages, propagating mismatch across the cycle. To address this, we propose a modular structural framework that decomposes the cyclic process into stage-wise sub-problems. While generally applicable, we instantiate this principle as CycleFQI, an extension of fitted Q-iteration enabling theoretical analysis and interpretation. It uses a vector of stage-specific Q-functions, tailored to each stage, to capture within-stage sequences and transitions between stages. This modular design enables partial control, allowing some stages to be optimized while others follow predefined policies. We establish finite-sample suboptimality error bounds and derive global convergence rates under Besov regularity, demonstrating that CycleFQI mitigates the curse of dimensionality compared to monolithic baselines. Additionally, we propose a sieve-based method for asymptotic inference of optimal policy values under a margin condition. Experiments on simulated and real-world Type 1 Diabetes data sets demonstrate CycleFQI's effectiveness.

Provable Offline Reinforcement Learning for Structured Cyclic MDPs

TL;DR

This work introduces a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle, and proposes CycleFQI, an extension of fitted Q-iteration enabling theoretical analysis and interpretation.

Abstract

We introduce a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle. In this setting, offline learning is challenging: optimizing a policy at any stage shifts the state distributions of subsequent stages, propagating mismatch across the cycle. To address this, we propose a modular structural framework that decomposes the cyclic process into stage-wise sub-problems. While generally applicable, we instantiate this principle as CycleFQI, an extension of fitted Q-iteration enabling theoretical analysis and interpretation. It uses a vector of stage-specific Q-functions, tailored to each stage, to capture within-stage sequences and transitions between stages. This modular design enables partial control, allowing some stages to be optimized while others follow predefined policies. We establish finite-sample suboptimality error bounds and derive global convergence rates under Besov regularity, demonstrating that CycleFQI mitigates the curse of dimensionality compared to monolithic baselines. Additionally, we propose a sieve-based method for asymptotic inference of optimal policy values under a margin condition. Experiments on simulated and real-world Type 1 Diabetes data sets demonstrate CycleFQI's effectiveness.
Paper Structure (54 sections, 11 theorems, 228 equations, 4 figures, 7 tables, 2 algorithms)

This paper contains 54 sections, 11 theorems, 228 equations, 4 figures, 7 tables, 2 algorithms.

Key Result

Proposition 2

Let $H = \sum_{k=1}^K H_k$ and $\gamma_{\mathrm{cycle}} = \prod_{k=1}^K \gamma_k$. For any update set $\mathcal{U} \subseteq \{1,\dots,K\}$ and vectors $\mathbf{f}, \mathbf{g} \in \prod_{k=1}^K L_\infty(\mathcal{S}_k \times \mathcal{A}_k)$, the operator $\mathbf{T}_{\mathcal{U}}$ defined by Equation

Figures (4)

  • Figure 1: Illustration of a cyclic MDP with $K=3$ stages. Each stage $k$ is an MDP $\mathcal{M}_k$ with $\tau_k$ steps, connected cyclically via transitions $\phi_k$ with discounts $\gamma_k$. We estimate the optimal Q-function $Q_k^*$ for each stage, maximizing expected discounted reward over an infinite loop starting from stage $k$.
  • Figure 2: Diagnostic plots for statistical inference based on $n=2400$ samples, each estimated across 200 independent trials. The left panel presents a Q-Q plot of the empirical squared error statistic $D^2$ against the theoretical $\chi^2_3$ quantiles. The right panel shows a scatter plot of the estimates $(\hat{v}_1, \hat{v}_2)$ in relation to the ground truth $\mathbf{v}^*$ (blue star) and the empirical mean (green star).
  • Figure 3: Box plots of estimated values $\widehat{V}_k(s_k)$ at test sample states for each stage under four update sets in the T1D data set. Colors indicate update sets. For each time period and update set, two box plots are shown for comparison; the algorithm corresponding to each box is indicated in the legend (left: $\mathtt{CycleFQI}$; right: flattened FQI).
  • Figure 4: Bootstrap distribution of estimated values $\widehat{V}_k$ at stage starts in T1D analysis. Red line: average observed test cumulative reward; green line: bootstrap mean; black lines: 90% intervals.

Theorems & Definitions (16)

  • Remark 1
  • Proposition 2: Contraction Property
  • Remark 3: Role of Distributional Coverage
  • Theorem 4: Suboptimality Bound for $\mathtt{CycleFQI}$
  • Theorem 5: Finite Sample Convergence Rate under Besov Regularity
  • Corollary 6: Finite-Sample Error Comparison
  • Corollary 7: Worst-Case Lower Bound for Q-Function Estimation
  • Remark 8: Utilization of Structural Information
  • Theorem 9: Expected Finite-Sample Rate with Random Forests
  • Theorem 10: Asymptotic Normality
  • ...and 6 more