Provable Offline Reinforcement Learning for Structured Cyclic MDPs

Kyungbok Lee; Angelica Cristello Sarteau; Michael R. Kosorok

Provable Offline Reinforcement Learning for Structured Cyclic MDPs

Kyungbok Lee, Angelica Cristello Sarteau, Michael R. Kosorok

TL;DR

This work introduces a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle, and proposes CycleFQI, an extension of fitted Q-iteration enabling theoretical analysis and interpretation.

Abstract

We introduce a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle. In this setting, offline learning is challenging: optimizing a policy at any stage shifts the state distributions of subsequent stages, propagating mismatch across the cycle. To address this, we propose a modular structural framework that decomposes the cyclic process into stage-wise sub-problems. While generally applicable, we instantiate this principle as CycleFQI, an extension of fitted Q-iteration enabling theoretical analysis and interpretation. It uses a vector of stage-specific Q-functions, tailored to each stage, to capture within-stage sequences and transitions between stages. This modular design enables partial control, allowing some stages to be optimized while others follow predefined policies. We establish finite-sample suboptimality error bounds and derive global convergence rates under Besov regularity, demonstrating that CycleFQI mitigates the curse of dimensionality compared to monolithic baselines. Additionally, we propose a sieve-based method for asymptotic inference of optimal policy values under a margin condition. Experiments on simulated and real-world Type 1 Diabetes data sets demonstrate CycleFQI's effectiveness.

Provable Offline Reinforcement Learning for Structured Cyclic MDPs

TL;DR

Abstract

Paper Structure (54 sections, 11 theorems, 228 equations, 4 figures, 7 tables, 2 algorithms)

This paper contains 54 sections, 11 theorems, 228 equations, 4 figures, 7 tables, 2 algorithms.

Introduction and Related Work
Problem Setup: Cyclic MDP
Structure of the Cyclic MDP
Action-Value and State-Value Functions
Constrained Optimality and the Bellman Operator
Proposed Method: Cyclic Fitted Q-Iteration (CycleFQI)
Algorithm: Cyclic Fitted Q-Iteration
Finite-Sample Analysis for CycleFQI
High-Probability Bound on Suboptimality Gap
Finite-Sample Rates on Suboptimality Gap in Besov Spaces
Mitigating the Curse of Dimensionality via Decomposition
Expected Finite-Sample Rate with Random Forests.
Asymptotic Inference with Sieve Approximations
Sieve-based Estimation Framework
Linear Sieve Approximation
...and 39 more sections

Key Result

Proposition 2

Let $H = \sum_{k=1}^K H_k$ and $\gamma_{\mathrm{cycle}} = \prod_{k=1}^K \gamma_k$. For any update set $\mathcal{U} \subseteq \{1,\dots,K\}$ and vectors $\mathbf{f}, \mathbf{g} \in \prod_{k=1}^K L_\infty(\mathcal{S}_k \times \mathcal{A}_k)$, the operator $\mathbf{T}_{\mathcal{U}}$ defined by Equation

Figures (4)

Figure 1: Illustration of a cyclic MDP with $K=3$ stages. Each stage $k$ is an MDP $\mathcal{M}_k$ with $\tau_k$ steps, connected cyclically via transitions $\phi_k$ with discounts $\gamma_k$. We estimate the optimal Q-function $Q_k^*$ for each stage, maximizing expected discounted reward over an infinite loop starting from stage $k$.
Figure 2: Diagnostic plots for statistical inference based on $n=2400$ samples, each estimated across 200 independent trials. The left panel presents a Q-Q plot of the empirical squared error statistic $D^2$ against the theoretical $\chi^2_3$ quantiles. The right panel shows a scatter plot of the estimates $(\hat{v}_1, \hat{v}_2)$ in relation to the ground truth $\mathbf{v}^*$ (blue star) and the empirical mean (green star).
Figure 3: Box plots of estimated values $\widehat{V}_k(s_k)$ at test sample states for each stage under four update sets in the T1D data set. Colors indicate update sets. For each time period and update set, two box plots are shown for comparison; the algorithm corresponding to each box is indicated in the legend (left: $\mathtt{CycleFQI}$; right: flattened FQI).
Figure 4: Bootstrap distribution of estimated values $\widehat{V}_k$ at stage starts in T1D analysis. Red line: average observed test cumulative reward; green line: bootstrap mean; black lines: 90% intervals.

Theorems & Definitions (16)

Remark 1
Proposition 2: Contraction Property
Remark 3: Role of Distributional Coverage
Theorem 4: Suboptimality Bound for $\mathtt{CycleFQI}$
Theorem 5: Finite Sample Convergence Rate under Besov Regularity
Corollary 6: Finite-Sample Error Comparison
Corollary 7: Worst-Case Lower Bound for Q-Function Estimation
Remark 8: Utilization of Structural Information
Theorem 9: Expected Finite-Sample Rate with Random Forests
Theorem 10: Asymptotic Normality
...and 6 more

Provable Offline Reinforcement Learning for Structured Cyclic MDPs

TL;DR

Abstract

Provable Offline Reinforcement Learning for Structured Cyclic MDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (16)