Table of Contents
Fetching ...

Off-Policy Temporal Difference Learning for Perturbed Markov Decision Processes: Theoretical Insights and Extensive Simulations

Ali Forootani, Raffaele Iervolino, Massimo Tipaldi, Mohammad Khosravi

TL;DR

This work tackles off-policy temporal-difference learning for perturbed Markov decision processes by projecting the value function into a feature subspace and explicitly modeling perturbations of the transition operator with a perturbed matrix Pbar. It develops a contraction-preserving projection-based ADP framework and proves convergence of the off-policy TD updates under positive definiteness of key matrices, deriving a bound on the deviation between the optimal cost J^* and the perturbed cost Jbar. The bound shows that ||J^* - Jbar||inf <= alpha ||R||inf /(1 - alpha) when ||P^* - Pbar||inf <= (1 - alpha). The framework is validated through extensive simulations on a resource-allocation MDP, and a MATLAB package is provided on GitHub, offering a scalable approach for robust policy optimization under controlled non-stationarity in large MDPs.

Abstract

Dynamic Programming suffers from the curse of dimensionality due to large state and action spaces, a challenge further compounded by uncertainties in the environment. To mitigate these issue, we explore an off-policy based Temporal Difference Approximate Dynamic Programming approach that preserves contraction mapping when projecting the problem into a subspace of selected features, accounting for the probability distribution of the perturbed transition probability matrix. We further demonstrate how this Approximate Dynamic Programming approach can be implemented as a particular variant of the Temporal Difference learning algorithm, adapted for handling perturbations. To validate our theoretical findings, we provide a numerical example using a Markov Decision Process corresponding to a resource allocation problem.

Off-Policy Temporal Difference Learning for Perturbed Markov Decision Processes: Theoretical Insights and Extensive Simulations

TL;DR

This work tackles off-policy temporal-difference learning for perturbed Markov decision processes by projecting the value function into a feature subspace and explicitly modeling perturbations of the transition operator with a perturbed matrix Pbar. It develops a contraction-preserving projection-based ADP framework and proves convergence of the off-policy TD updates under positive definiteness of key matrices, deriving a bound on the deviation between the optimal cost J^* and the perturbed cost Jbar. The bound shows that ||J^* - Jbar||inf <= alpha ||R||inf /(1 - alpha) when ||P^* - Pbar||inf <= (1 - alpha). The framework is validated through extensive simulations on a resource-allocation MDP, and a MATLAB package is provided on GitHub, offering a scalable approach for robust policy optimization under controlled non-stationarity in large MDPs.

Abstract

Dynamic Programming suffers from the curse of dimensionality due to large state and action spaces, a challenge further compounded by uncertainties in the environment. To mitigate these issue, we explore an off-policy based Temporal Difference Approximate Dynamic Programming approach that preserves contraction mapping when projecting the problem into a subspace of selected features, accounting for the probability distribution of the perturbed transition probability matrix. We further demonstrate how this Approximate Dynamic Programming approach can be implemented as a particular variant of the Temporal Difference learning algorithm, adapted for handling perturbations. To validate our theoretical findings, we provide a numerical example using a Markov Decision Process corresponding to a resource allocation problem.

Paper Structure

This paper contains 7 sections, 4 theorems, 26 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

For any stochastic matrix $\mathcal{P}$ corresponding to irreducible and regular Markov chain with associated stationary probability distribution $\epsilon$, whose elements are arranged along the diagonal of a diagonal matrix $\Theta$, the matrix $\Theta (I - \alpha \mathcal{P})$ is positive definit

Figures (3)

  • Figure 1: The graph shows the MDP state space and the corresponding state transition probabilities for a resource allocation problem with $N=2$ and $m=2$, and for the control input $u(k)=c_1$. The system states are $(x_1,x_2)\in \{(0,0),(0,1),(1,0),(1,1),(0,2),(2,0)\}$, where $x_1$ and $x_2$ are associated to $c_1$ and $c_2$, respectively.
  • Figure 2: The behavior of parameter vector $r$ through off-policy TD approach for the resource allocation with $m=4$ prices, $N=20$ resources and $|\mathcal{X}|=10626$.
  • Figure 3: Comparing different target policies on the parameter vector $\bar{r}$ through off-policy TD approach.

Theorems & Definitions (12)

  • Lemma 1
  • proof
  • Remark 1
  • Lemma 2
  • proof
  • Theorem 3
  • proof
  • Remark 2
  • Theorem 4
  • proof
  • ...and 2 more