Natural Policy Gradient for Average Reward Non-Stationary RL
Neharika Jali, Eshika Pathak, Pranay Sharma, Guannan Qu, Gauri Joshi
TL;DR
This work addresses non-stationary reinforcement learning in the infinite-horizon average-reward setting, modeling dynamics with time-varying rewards and transitions under a variation budget $\Delta_T$. It introduces NS-NAC, a model-free policy-gradient algorithm with restart-based exploration and two-timescale updates, and a parameter-free BORL-NS-NAC that tunes hyperparameters via a bandit-over-RL framework. The authors establish a dynamic regret bound of $\tilde{O}(|S|^{1/2}|A|^{1/2}\Delta_T^{1/6}T^{5/6})$ and provide a detailed regret-decomposition analysis that handles non-stationarity through Lyapunov-function-based techniques and auxiliary Markov chains. Empirical results on synthetic non-stationary MDPs show that NS-NAC and BORL-NS-NAC achieve sublinear dynamic regret, demonstrating the viability of model-free policy-based methods in continual non-stationary RL scenarios.
Abstract
We consider the problem of non-stationary reinforcement learning (RL) in the infinite-horizon average-reward setting. We model it by a Markov Decision Process with time-varying rewards and transition probabilities, with a variation budget of $Δ_T$. Existing non-stationary RL algorithms focus on model-based and model-free value-based methods. Policy-based methods despite their flexibility in practice are not theoretically well understood in non-stationary RL. We propose and analyze the first model-free policy-based algorithm, Non-Stationary Natural Actor-Critic (NS-NAC), a policy gradient method with a restart based exploration for change and a novel interpretation of learning rates as adapting factors. Further, we present a bandit-over-RL based parameter-free algorithm BORL-NS-NAC that does not require prior knowledge of the variation budget $Δ_T$. We present a dynamic regret of $\tilde{\mathscr O}(|S|^{1/2}|A|^{1/2}Δ_T^{1/6}T^{5/6})$ for both algorithms, where $T$ is the time horizon, and $|S|$, $|A|$ are the sizes of the state and action spaces. The regret analysis leverages a novel adaptation of the Lyapunov function analysis of NAC to dynamic environments and characterizes the effects of simultaneous updates in policy, value function estimate and changes in the environment.
