Non-stationary and Varying-discounting Markov Decision Processes for Reinforcement Learning
Zhizuo Chen, Theodore T. Allen
TL;DR
The paper introduces NVMDP, a generalization of NSMDP that allows time- and transition-based discounting to address non-stationary environments and finite-horizon tasks. It develops rigorous foundations including value functions, matrix representations, optimality results, and policy-improvement theorems, and extends DP, Q-learning, and generalized Q-learning to NVMDPs with convergence guarantees. It further adapts policy gradient and TRPO theory to NVMDPs and demonstrates through a non-stationary Tricky Gridworld that NVMDP-based methods recover optimal trajectories where standard Q-learning fails, while also enabling explicit policy shaping via discounting. The work shows that NVMDPs unify classic and finite-horizon MDPs under a single framework and require only minor algorithmic changes to achieve robust handling of non-stationarity and trajectory shaping, suggesting broad practical impact for RL in dynamic environments.
Abstract
Algorithms developed under stationary Markov Decision Processes (MDPs) often face challenges in non-stationary environments, and infinite-horizon formulations may not directly apply to finite-horizon tasks. To address these limitations, we introduce the Non-stationary and Varying-discounting MDP (NVMDP) framework, which naturally accommodates non-stationarity and allows discount rates to vary with time and transitions. Infinite-horizon, stationary MDPs emerge as special cases of NVMDPs for identifying an optimal policy, and finite-horizon MDPs are also subsumed within the NVMDP formulations. Moreover, NVMDPs provide a flexible mechanism to shape optimal policies, without altering the state space, action space, or the reward structure. We establish the theoretical foundations of NVMDPs, including assumptions, state- and action-value formulation and recursion, matrix representation, optimality conditions, and policy improvement under finite state and action spaces. Building on these results, we adapt dynamic programming and generalized Q-learning algorithms to NVMDPs, along with formal convergence proofs. For problems requiring function approximation, we extend the Policy Gradient Theorem and the policy improvement bound in Trust Region Policy Optimization (TRPO), offering proofs in both scalar and matrix forms. Empirical evaluations in a non-stationary gridworld environment demonstrate that NVMDP-based algorithms successfully recover optimal trajectories under multiple reward and discounting schemes, whereas original Q-learning fails. These results collectively show that NVMDPs provide a theoretically sound and practically effective framework for reinforcement learning, requiring only minor algorithmic modifications while enabling robust handling of non-stationarity and explicit optimal policy shaping.
