Learning in Markov Decision Processes with Exogenous Dynamics

Davide Maran; Davide Salaorni; Marcello Restelli

Learning in Markov Decision Processes with Exogenous Dynamics

Davide Maran, Davide Salaorni, Marcello Restelli

TL;DR

This work studies a structured class of MDPs characterized by exogenous state components whose transitions are independent of the agent's actions, and establishes a matching lower bound, showing that this dependence is information-theoretically optimal.

Abstract

Reinforcement learning algorithms are typically designed for generic Markov Decision Processes (MDPs), where any state-action pair can lead to an arbitrary transition distribution. In many practical systems, however, only a subset of the state variables is directly influenced by the agent's actions, while the remaining components evolve according to exogenous dynamics and account for most of the stochasticity. In this work, we study a structured class of MDPs characterized by exogenous state components whose transitions are independent of the agent's actions. We show that exploiting this structure yields significantly improved learning guarantees, with only the size of the exogenous state space appearing in the leading terms of the regret bounds. We further establish a matching lower bound, showing that this dependence is information-theoretically optimal. Finally, we empirically validate our approach across classical toy settings and real-world-inspired environments, demonstrating substantial gains in sample efficiency compared to standard reinforcement learning methods.

Learning in Markov Decision Processes with Exogenous Dynamics

TL;DR

Abstract

Paper Structure (43 sections, 15 theorems, 89 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 43 sections, 15 theorems, 89 equations, 6 figures, 7 tables, 2 algorithms.

Introduction
Framework Formulation
The PCMDP Framework
Trading.
Reservoir Management.
Algorithms
Model-based Approach: ExAVI
Algorithm Structure.
Theoretical Guarantees.
Model-free Approach: ExAQ
Algorithm Structure.
Theoretical Guarantees.
Lower Bound.
Experiments
Taxi with Traffic Environment.
...and 28 more sections

Key Result

Theorem 2

Under Definition def:pcmdp, with probability at least $1-\delta$, the regret of ExAVI (Algorithm alg:exavi) satisfies:

Figures (6)

Figure 1: Comparative learning curves for the TaxiEnv, averaged over $10$ training seeds with $95\%$ confidence intervals. Figure (a): Model-based algorithms (ExAVI vs UCBVI). Figure (b): Model-free algorithms (ExAQ vs QL).
Figure 2: Comparative learning curves of model-free algorithms for the TradingEnv, averaged over $10$ training seeds with $95\%$ confidence intervals. Figure (a): Performance on a linear x-axis, highlighting asymptotic convergence. Figure (b): The same experiment plotted on a logarithmic x-axis, highlighting the dramatic sample efficiency gap in the early stages of training ($10^0 - 10^2$ episodes).
Figure 3: The uncontroallable part of the hard MDP family
Figure 4: Comparative learning curves for the tiny version of ElevatorEnv, averaged over $10$ random seeds. Shaded regions denote $95\%$ confidence intervals. Figure (a) Model-based performance. Figure (b) Model-free performance.
Figure 5: Inventory liquidation profiles for the optimal execution problem. ExAQ (blue) discovers a balanced strategy between the passive TWAP (pink) and the aggressive dumping of PPO/QL.
...and 1 more figures

Theorems & Definitions (26)

Definition 1
Theorem 2
Theorem 3
Theorem 4
Theorem 5: Bernstein's inequality boucheron2003concentration
Lemma 1: Lemma 4.1. in jin2018q
Theorem 6
proof
Proposition 7
proof
...and 16 more

Learning in Markov Decision Processes with Exogenous Dynamics

TL;DR

Abstract

Learning in Markov Decision Processes with Exogenous Dynamics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (26)