Table of Contents
Fetching ...

Natural Policy Gradient for Average Reward Non-Stationary RL

Neharika Jali, Eshika Pathak, Pranay Sharma, Guannan Qu, Gauri Joshi

TL;DR

This work addresses non-stationary reinforcement learning in the infinite-horizon average-reward setting, modeling dynamics with time-varying rewards and transitions under a variation budget $\Delta_T$. It introduces NS-NAC, a model-free policy-gradient algorithm with restart-based exploration and two-timescale updates, and a parameter-free BORL-NS-NAC that tunes hyperparameters via a bandit-over-RL framework. The authors establish a dynamic regret bound of $\tilde{O}(|S|^{1/2}|A|^{1/2}\Delta_T^{1/6}T^{5/6})$ and provide a detailed regret-decomposition analysis that handles non-stationarity through Lyapunov-function-based techniques and auxiliary Markov chains. Empirical results on synthetic non-stationary MDPs show that NS-NAC and BORL-NS-NAC achieve sublinear dynamic regret, demonstrating the viability of model-free policy-based methods in continual non-stationary RL scenarios.

Abstract

We consider the problem of non-stationary reinforcement learning (RL) in the infinite-horizon average-reward setting. We model it by a Markov Decision Process with time-varying rewards and transition probabilities, with a variation budget of $Δ_T$. Existing non-stationary RL algorithms focus on model-based and model-free value-based methods. Policy-based methods despite their flexibility in practice are not theoretically well understood in non-stationary RL. We propose and analyze the first model-free policy-based algorithm, Non-Stationary Natural Actor-Critic (NS-NAC), a policy gradient method with a restart based exploration for change and a novel interpretation of learning rates as adapting factors. Further, we present a bandit-over-RL based parameter-free algorithm BORL-NS-NAC that does not require prior knowledge of the variation budget $Δ_T$. We present a dynamic regret of $\tilde{\mathscr O}(|S|^{1/2}|A|^{1/2}Δ_T^{1/6}T^{5/6})$ for both algorithms, where $T$ is the time horizon, and $|S|$, $|A|$ are the sizes of the state and action spaces. The regret analysis leverages a novel adaptation of the Lyapunov function analysis of NAC to dynamic environments and characterizes the effects of simultaneous updates in policy, value function estimate and changes in the environment.

Natural Policy Gradient for Average Reward Non-Stationary RL

TL;DR

This work addresses non-stationary reinforcement learning in the infinite-horizon average-reward setting, modeling dynamics with time-varying rewards and transitions under a variation budget . It introduces NS-NAC, a model-free policy-gradient algorithm with restart-based exploration and two-timescale updates, and a parameter-free BORL-NS-NAC that tunes hyperparameters via a bandit-over-RL framework. The authors establish a dynamic regret bound of and provide a detailed regret-decomposition analysis that handles non-stationarity through Lyapunov-function-based techniques and auxiliary Markov chains. Empirical results on synthetic non-stationary MDPs show that NS-NAC and BORL-NS-NAC achieve sublinear dynamic regret, demonstrating the viability of model-free policy-based methods in continual non-stationary RL scenarios.

Abstract

We consider the problem of non-stationary reinforcement learning (RL) in the infinite-horizon average-reward setting. We model it by a Markov Decision Process with time-varying rewards and transition probabilities, with a variation budget of . Existing non-stationary RL algorithms focus on model-based and model-free value-based methods. Policy-based methods despite their flexibility in practice are not theoretically well understood in non-stationary RL. We propose and analyze the first model-free policy-based algorithm, Non-Stationary Natural Actor-Critic (NS-NAC), a policy gradient method with a restart based exploration for change and a novel interpretation of learning rates as adapting factors. Further, we present a bandit-over-RL based parameter-free algorithm BORL-NS-NAC that does not require prior knowledge of the variation budget . We present a dynamic regret of for both algorithms, where is the time horizon, and , are the sizes of the state and action spaces. The regret analysis leverages a novel adaptation of the Lyapunov function analysis of NAC to dynamic environments and characterizes the effects of simultaneous updates in policy, value function estimate and changes in the environment.

Paper Structure

This paper contains 46 sections, 34 theorems, 150 equations, 5 figures, 1 table, 2 algorithms.

Key Result

Lemma 5.2

Under assumption:ergodic, for all potential policies ${\bm{{\pi}}}_t$ in all environments ${\mathbf{P}}_t$, $t \in [T]$, the matrix $\Bar{{\mathbf{A}}}^{{\bm{{\pi}}}_t, {\mathbf{P}}_t}$ is negative semi-definite. Define its maximum non-zero eigenvalue as $-\lambda$.

Figures (5)

  • Figure 1: Performance of NS-NAC and baseline algorithms across various settings. (a) Dynamic regret for a single instance with $T = 25\times10^4$ steps. Log-log plots showing the effect of varying: (b) time horizon $T$, and (c) variation budget $\Delta_T$.
  • Figure 2: Log-log plots showing the effect of varying: (a) number of states $|{\mathcal{S}}|$, and (b) number of actions $|{\mathcal{A}}|$.
  • Figure 3: Performance of NS-NAC with different step-sizes in an environment with 17 abrupt, randomly scheduled switches over $T = 4 \times 10^3$ steps.
  • Figure 4: Performance of NS-NAC and baseline algorithms in various non-stationary settings. (a) Dynamic regret for a single instance over $T = 1\times10^4$ steps in an environment with 50 abrupt, randomly scheduled switches. (b) Dynamic regret for a single instance over $T = 1\times10^4$ steps in an environment with small, continuous changes.
  • Figure : Non-Stationary Natural Actor-Critic (NS-NAC)

Theorems & Definitions (64)

  • Lemma 5.2: zhang2021finite, Lemma 2
  • Theorem 5.3
  • Theorem 5.4: mao2021nearOptimal, Proposition 1
  • Theorem 6.1
  • Theorem 4.1
  • proof
  • Proposition 4.2
  • proof
  • Proposition 4.3
  • proof
  • ...and 54 more