Off-Policy Temporal Difference Learning for Perturbed Markov Decision Processes: Theoretical Insights and Extensive Simulations
Ali Forootani, Raffaele Iervolino, Massimo Tipaldi, Mohammad Khosravi
TL;DR
This work tackles off-policy temporal-difference learning for perturbed Markov decision processes by projecting the value function into a feature subspace and explicitly modeling perturbations of the transition operator with a perturbed matrix Pbar. It develops a contraction-preserving projection-based ADP framework and proves convergence of the off-policy TD updates under positive definiteness of key matrices, deriving a bound on the deviation between the optimal cost J^* and the perturbed cost Jbar. The bound shows that ||J^* - Jbar||inf <= alpha ||R||inf /(1 - alpha) when ||P^* - Pbar||inf <= (1 - alpha). The framework is validated through extensive simulations on a resource-allocation MDP, and a MATLAB package is provided on GitHub, offering a scalable approach for robust policy optimization under controlled non-stationarity in large MDPs.
Abstract
Dynamic Programming suffers from the curse of dimensionality due to large state and action spaces, a challenge further compounded by uncertainties in the environment. To mitigate these issue, we explore an off-policy based Temporal Difference Approximate Dynamic Programming approach that preserves contraction mapping when projecting the problem into a subspace of selected features, accounting for the probability distribution of the perturbed transition probability matrix. We further demonstrate how this Approximate Dynamic Programming approach can be implemented as a particular variant of the Temporal Difference learning algorithm, adapted for handling perturbations. To validate our theoretical findings, we provide a numerical example using a Markov Decision Process corresponding to a resource allocation problem.
