Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning

Jia Wan; Sean R. Sinclair; Devavrat Shah; Martin J. Wainwright

Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning

Jia Wan, Sean R. Sinclair, Devavrat Shah, Martin J. Wainwright

TL;DR

This paper introduces Exo-MDPs, a Markov decision process class with a state split into exogenous and endogenous components, where exogenous dynamics are action-agnostic and endogenous dynamics are deterministic. It establishes a structural equivalence that places Exo-MDPs on par with discrete MDPs and linear mixture MDPs, enabling policy learning with regret that scales with the exogenous dimension $d$ rather than the potentially large endogenous spaces. In the no-observation setting, the authors derive lower bounds $\Omega(Hd\sqrt{K})$ (time-homogeneous) and $\Omega(H^{3/2}d\sqrt{K})$ (time-inhomogeneous), and provide near-optimal algorithms achieving $\tilde{O}(H^{3/2}d\sqrt{K})$, with an effective-dimension refinement $r$ yielding $\tilde{O}(H^{3/2}r\sqrt{K})$. When exogenous states are observed, a plug-in method attains $\tilde{O}(H^{3/2}\sqrt{dK})$, illustrating substantial gains from exogenous-information access. The paper validates theory with an inventory-control study, demonstrating practical sample efficiency and robustness, and discusses extensions to more general exogenous dynamics and lead-time scenarios. Overall, it shows that exploiting exogenous structure can dramatically decouple sample complexity from endogenous-state and action-space sizes, enabling data-efficient RL in structured MDPs relevant to operations research and related domains.

Abstract

We study Exo-MDPs, a structured class of Markov Decision Processes (MDPs) where the state space is partitioned into exogenous and endogenous components. Exogenous states evolve stochastically, independent of the agent's actions, while endogenous states evolve deterministically based on both state components and actions. Exo-MDPs are useful for applications including inventory control, portfolio management, and ride-sharing. Our first result is structural, establishing a representational equivalence between the classes of discrete MDPs, Exo-MDPs, and discrete linear mixture MDPs. Specifically, any discrete MDP can be represented as an Exo-MDP, and the transition and reward dynamics can be written as linear functions of the exogenous state distribution, showing that Exo-MDPs are instances of linear mixture MDPs. For unobserved exogenous states, we prove a regret upper bound of $O(H^{3/2}d\sqrt{K})$ over $K$ trajectories of horizon $H$, with $d$ as the size of the exogenous state space, and establish nearly-matching lower bounds. Our findings demonstrate how Exo-MDPs decouple sample complexity from action and endogenous state sizes, and we validate our theoretical insights with experiments on inventory control.

Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning

TL;DR

rather than the potentially large endogenous spaces. In the no-observation setting, the authors derive lower bounds

(time-homogeneous) and

(time-inhomogeneous), and provide near-optimal algorithms achieving

, with an effective-dimension refinement

yielding

. When exogenous states are observed, a plug-in method attains

, illustrating substantial gains from exogenous-information access. The paper validates theory with an inventory-control study, demonstrating practical sample efficiency and robustness, and discusses extensions to more general exogenous dynamics and lead-time scenarios. Overall, it shows that exploiting exogenous structure can dramatically decouple sample complexity from endogenous-state and action-space sizes, enabling data-efficient RL in structured MDPs relevant to operations research and related domains.

Abstract

over

trajectories of horizon

, with

as the size of the exogenous state space, and establish nearly-matching lower bounds. Our findings demonstrate how Exo-MDPs decouple sample complexity from action and endogenous state sizes, and we validate our theoretical insights with experiments on inventory control.

Paper Structure (35 sections, 16 theorems, 72 equations, 3 figures, 4 tables, 2 algorithms)

This paper contains 35 sections, 16 theorems, 72 equations, 3 figures, 4 tables, 2 algorithms.

Introduction
Contributions
Organization
Related work
Background and problem formulation
Exo-MDP: Markov Decision Processes with Exogenous States
Observations and Performance Objective
Results on Structural Equivalence
Exo-MDP: Sample Efficient Algorithm, Matching Lower Bound
Lower Bound on Regret
Sample efficient algorithm
Infection Model with Vaccines: Impact of Effective Dimension
Full Observation of the Exogenous States
Plug-In Method for the Full Observation and IID Regime
Extension to General Dynamics of the Exogenous States
...and 20 more sections

Key Result

Theorem 1

The classes of Exo-MDPs, discrete MDPs, and discrete linear mixture MDPs are equivalent. More specifically,

Figures (3)

Figure 1: Directed graphical models representing a generic MDP (left), and an Exo-MDP (middle). In a generic MDP, the state space is fully endogenous, the current state $S_h$ and action $A_h$ impact the next state $S_{h+1}$ and reward $R_{h}$. In an Exo-MDP, the state is partitioned into endogenous component $S_h$ and exogenous component $X_h$. The $X_h$ is drawn i.i.d per distribution $\mathbb{P}_{x}$ independent of $(S_h, A_h)$. The known deterministic functions $\mathbf{f}$, $\mathbf{g}$ are such that $S_{h+1} = \mathbf{f}(S_h, A_h, X_h)$ and $R_h = \mathbf{g}(S_h, A_h, X_h)$. The right panel gives the structural equivalence relations between the class of Exo-MDPs, discrete MDPs and discrete linear mixture MDPs.
Figure 2: In \ref{['fig:sub2', 'fig:sub3']} on the $x$-axis we show the episode $k \in [1000]$ and on the $y$-axis the total cost $C^{\pi^k}$ under the different algorithms. \ref{['fig:sub2']} shows results on Scenario I, and \ref{['fig:sub3']} results for Scenario II. The grey line corresponds to the performance (total cost) of the optimal base-stock policy, and the black line to the performance of the optimal policy. We note that under Scenario II there is no optimality gap.
Figure 3: Here we plot the total cost function $C_1^b(s_1)$ as we vary the base-stock value $b$ in Scenario II. The $x$-axis denotes the base-stock value $b \in [0,10]$ and the $y$-axis $C_1^b(s_1)$.

Theorems & Definitions (25)

Definition 1
Theorem 1
Lemma 1
proof
Lemma 2
Theorem 2
Theorem 3
proof
Theorem 4
Corollary 1
...and 15 more

Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning

TL;DR

Abstract

Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (25)