Table of Contents
Fetching ...

Learning in Markov Decision Processes with Exogenous Dynamics

Davide Maran, Davide Salaorni, Marcello Restelli

TL;DR

This work studies a structured class of MDPs characterized by exogenous state components whose transitions are independent of the agent's actions, and establishes a matching lower bound, showing that this dependence is information-theoretically optimal.

Abstract

Reinforcement learning algorithms are typically designed for generic Markov Decision Processes (MDPs), where any state-action pair can lead to an arbitrary transition distribution. In many practical systems, however, only a subset of the state variables is directly influenced by the agent's actions, while the remaining components evolve according to exogenous dynamics and account for most of the stochasticity. In this work, we study a structured class of MDPs characterized by exogenous state components whose transitions are independent of the agent's actions. We show that exploiting this structure yields significantly improved learning guarantees, with only the size of the exogenous state space appearing in the leading terms of the regret bounds. We further establish a matching lower bound, showing that this dependence is information-theoretically optimal. Finally, we empirically validate our approach across classical toy settings and real-world-inspired environments, demonstrating substantial gains in sample efficiency compared to standard reinforcement learning methods.

Learning in Markov Decision Processes with Exogenous Dynamics

TL;DR

This work studies a structured class of MDPs characterized by exogenous state components whose transitions are independent of the agent's actions, and establishes a matching lower bound, showing that this dependence is information-theoretically optimal.

Abstract

Reinforcement learning algorithms are typically designed for generic Markov Decision Processes (MDPs), where any state-action pair can lead to an arbitrary transition distribution. In many practical systems, however, only a subset of the state variables is directly influenced by the agent's actions, while the remaining components evolve according to exogenous dynamics and account for most of the stochasticity. In this work, we study a structured class of MDPs characterized by exogenous state components whose transitions are independent of the agent's actions. We show that exploiting this structure yields significantly improved learning guarantees, with only the size of the exogenous state space appearing in the leading terms of the regret bounds. We further establish a matching lower bound, showing that this dependence is information-theoretically optimal. Finally, we empirically validate our approach across classical toy settings and real-world-inspired environments, demonstrating substantial gains in sample efficiency compared to standard reinforcement learning methods.
Paper Structure (43 sections, 15 theorems, 89 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 43 sections, 15 theorems, 89 equations, 6 figures, 7 tables, 2 algorithms.

Key Result

Theorem 2

Under Definition def:pcmdp, with probability at least $1-\delta$, the regret of ExAVI (Algorithm alg:exavi) satisfies:

Figures (6)

  • Figure 1: Comparative learning curves for the TaxiEnv, averaged over $10$ training seeds with $95\%$ confidence intervals. Figure (a): Model-based algorithms (ExAVI vs UCBVI). Figure (b): Model-free algorithms (ExAQ vs QL).
  • Figure 2: Comparative learning curves of model-free algorithms for the TradingEnv, averaged over $10$ training seeds with $95\%$ confidence intervals. Figure (a): Performance on a linear x-axis, highlighting asymptotic convergence. Figure (b): The same experiment plotted on a logarithmic x-axis, highlighting the dramatic sample efficiency gap in the early stages of training ($10^0 - 10^2$ episodes).
  • Figure 3: The uncontroallable part of the hard MDP family
  • Figure 4: Comparative learning curves for the tiny version of ElevatorEnv, averaged over $10$ random seeds. Shaded regions denote $95\%$ confidence intervals. Figure (a) Model-based performance. Figure (b) Model-free performance.
  • Figure 5: Inventory liquidation profiles for the optimal execution problem. ExAQ (blue) discovers a balanced strategy between the passive TWAP (pink) and the aggressive dumping of PPO/QL.
  • ...and 1 more figures

Theorems & Definitions (26)

  • Definition 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5: Bernstein's inequality boucheron2003concentration
  • Lemma 1: Lemma 4.1. in jin2018q
  • Theorem 6
  • proof
  • Proposition 7
  • proof
  • ...and 16 more