Table of Contents
Fetching ...

Score-Aware Policy-Gradient and Performance Guarantees using Local Lyapunov Stability

Céline Comte, Matthieu Jonckheere, Jaron Sanders, Albert Senen-Cerda

TL;DR

This work introduces Score-Aware Gradient Estimators (SAGE) for policy-gradient learning in model-based RL when the stationary distribution of the MDP lies in an exponential family linked to policy parameters. By leveraging the score of the exponential family, SAGE yields a gradient estimator that avoids value-function estimation, enabling efficient updates even on countably infinite state spaces and in the presence of unstable policies. The authors prove local convergence and derive regret bounds under a local Lyapunov framework, with an entropy-regularized variant ensuring bounded maxima. Numerical experiments across admission-control, load-balancing, and Ising-Glauber dynamics show SAGE can outperform actor-critic, particularly when the system structure yields a low-dimensional sufficient statistic. The results highlight the practical value of incorporating model-specific stationary-distribution information into gradient-based learning for complex stochastic systems.

Abstract

In this paper, we introduce a policy-gradient method for model-based reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from Markov decision processes (MDPs) in stochastic networks, queueing systems, and statistical mechanics. Specifically, when the stationary distribution of the MDP belongs to an exponential family that is parametrized by policy parameters, we can improve existing policy gradient methods for average-reward RL. Our key identification is a family of gradient estimators, called score-aware gradient estimators (SAGEs), that enable policy gradient estimation without relying on value-function estimation in the aforementioned setting. We show that SAGE-based policy-gradient locally converges, and we obtain its regret. This includes cases when the state space of the MDP is countable and unstable policies can exist. Under appropriate assumptions such as starting sufficiently close to a maximizer and the existence of a local Lyapunov function, the policy under SAGE-based stochastic gradient ascent has an overwhelming probability of converging to the associated optimal policy. Furthermore, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor-critic method on several examples inspired from stochastic networks, queueing systems, and models derived from statistical physics. Our results demonstrate that a SAGE-based method finds close-to-optimal policies faster than an actor-critic method.

Score-Aware Policy-Gradient and Performance Guarantees using Local Lyapunov Stability

TL;DR

This work introduces Score-Aware Gradient Estimators (SAGE) for policy-gradient learning in model-based RL when the stationary distribution of the MDP lies in an exponential family linked to policy parameters. By leveraging the score of the exponential family, SAGE yields a gradient estimator that avoids value-function estimation, enabling efficient updates even on countably infinite state spaces and in the presence of unstable policies. The authors prove local convergence and derive regret bounds under a local Lyapunov framework, with an entropy-regularized variant ensuring bounded maxima. Numerical experiments across admission-control, load-balancing, and Ising-Glauber dynamics show SAGE can outperform actor-critic, particularly when the system structure yields a low-dimensional sufficient statistic. The results highlight the practical value of incorporating model-specific stationary-distribution information into gradient-based learning for complex stochastic systems.

Abstract

In this paper, we introduce a policy-gradient method for model-based reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from Markov decision processes (MDPs) in stochastic networks, queueing systems, and statistical mechanics. Specifically, when the stationary distribution of the MDP belongs to an exponential family that is parametrized by policy parameters, we can improve existing policy gradient methods for average-reward RL. Our key identification is a family of gradient estimators, called score-aware gradient estimators (SAGEs), that enable policy gradient estimation without relying on value-function estimation in the aforementioned setting. We show that SAGE-based policy-gradient locally converges, and we obtain its regret. This includes cases when the state space of the MDP is countable and unstable policies can exist. Under appropriate assumptions such as starting sufficiently close to a maximizer and the existence of a local Lyapunov function, the policy under SAGE-based stochastic gradient ascent has an overwhelming probability of converging to the associated optimal policy. Furthermore, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor-critic method on several examples inspired from stochastic networks, queueing systems, and models derived from statistical physics. Our results demonstrate that a SAGE-based method finds close-to-optimal policies faster than an actor-critic method.
Paper Structure (77 sections, 22 theorems, 233 equations, 5 figures, 1 table, 4 algorithms)

This paper contains 77 sections, 22 theorems, 233 equations, 5 figures, 1 table, 4 algorithms.

Key Result

Theorem 1

Suppose that ass:markovass:rewardass:stat hold. For each $\theta \in \Omega$, we have where $(S, A, R) \sim$eq:stat, $\mathrm{Cov} \mathopen{}\mathclose{\left[ R, x(S) \right] = (\mathrm{Cov} \mathopen{}\mathclose{\left[ R, x_1(S)} \right] , \ldots, \mathrm{Cov} \mathopen{}\mathclose{\left[ R, x_d(S)} \right] )^\intercal$, and the gradient and Jacobian operators, $\nabla$ and $\math

Figures (5)

  • Figure 1: Long-run average reward $J(\Theta_t)$ in the admission-control problem with $\lambda = 0.7$, $\mu = 1$, $\gamma = 5$, and $\eta = 1$. Using \ref{['app:mm1']}, we can verify that the long-run average reward under the best policy is approximately 2.183 if $k = 0$, 2.566 if $k = 1$, and 2.795 if $k \ge 3$.
  • Figure 2: Admission probabilities under policy parametrization $\pi_3$.
  • Figure 3: Long-run average reward in the admission-control problem with parameters $\lambda = 1.4$, $\mu = 1$, $\gamma = 5$, and $\eta = 1$. Using \ref{['app:mm1']}, we can verify that the maximal value of the long-run average reward is approximately 1.091 if $k = 0$ and 1.880 if $k \ge 2$.
  • Figure 4: Impact of the number of servers and service-rate imbalance on the performance of and actor--critic in a load-balancing system. Solid lines show the long-run average reward $J(\Theta_t)$, while dashed lines show the running average reward, $\frac{1}{t} \sum_{t' = 1}^t R_{t'}$. Simulations for $n = 100$ and $\delta = 4$ are omitted because numerical instability of Buzen's algorithm (see \ref{['app:lb']}) prevents us from computing $J(\Theta_t)$ in this case.
  • Figure 5: Performance of in the Ising model.

Theorems & Definitions (24)

  • Theorem 1
  • Theorem 2: Noncompact Case
  • Corollary 3: Sample Complexity
  • Proposition 4
  • Proposition 5: Performance Gap
  • Corollary 6: Regret
  • Proposition 7
  • Lemma 8
  • Proposition 9
  • Lemma 10
  • ...and 14 more