Table of Contents
Fetching ...

Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint

Nithia Vijayan, Prashanth L. A

TL;DR

This work tackles off-policy reinforcement learning by introducing two policy-gradient algorithms that integrate smoothed functional gradient estimation with off-policy evaluation via importance sampling. The plain OffP-SF method achieves a non-asymptotic convergence rate of $O\left(\frac{1}{\sqrt{N}}\right)$, while the SVRG-inspired OffP-SF-SVRG attains a faster $O\left(\frac{1}{N}\right)$ rate, both converging to an $\epsilon$-stationary point. A benchmark OffP-REINFORCE is analyzed for comparison, with all methods equipped with universal-step-size guarantees. The paper also validates the approach on CartPole, showing competitive performance and highlighting the benefit of variance reduction in the off-policy setting. Overall, the results establish that SF-based gradient estimation is a viable alternative to likelihood-ratio-based methods for off-policy policy optimization, with SVRG-style variance reduction yielding superior convergence rates.

Abstract

We propose two policy gradient algorithms for solving the problem of control in an off-policy reinforcement learning (RL) context. Both algorithms incorporate a smoothed functional (SF) based gradient estimation scheme. The first algorithm is a straightforward combination of importance sampling-based off-policy evaluation with SF-based gradient estimation. The second algorithm, inspired by the stochastic variance-reduced gradient (SVRG) algorithm, incorporates variance reduction in the update iteration. For both algorithms, we derive non-asymptotic bounds that establish convergence to an approximate stationary point. From these results, we infer that the first algorithm converges at a rate that is comparable to the well-known REINFORCE algorithm in an off-policy RL context, while the second algorithm exhibits an improved rate of convergence.

Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint

TL;DR

This work tackles off-policy reinforcement learning by introducing two policy-gradient algorithms that integrate smoothed functional gradient estimation with off-policy evaluation via importance sampling. The plain OffP-SF method achieves a non-asymptotic convergence rate of , while the SVRG-inspired OffP-SF-SVRG attains a faster rate, both converging to an -stationary point. A benchmark OffP-REINFORCE is analyzed for comparison, with all methods equipped with universal-step-size guarantees. The paper also validates the approach on CartPole, showing competitive performance and highlighting the benefit of variance reduction in the off-policy setting. Overall, the results establish that SF-based gradient estimation is a viable alternative to likelihood-ratio-based methods for off-policy policy optimization, with SVRG-style variance reduction yielding superior convergence rates.

Abstract

We propose two policy gradient algorithms for solving the problem of control in an off-policy reinforcement learning (RL) context. Both algorithms incorporate a smoothed functional (SF) based gradient estimation scheme. The first algorithm is a straightforward combination of importance sampling-based off-policy evaluation with SF-based gradient estimation. The second algorithm, inspired by the stochastic variance-reduced gradient (SVRG) algorithm, incorporates variance reduction in the update iteration. For both algorithms, we derive non-asymptotic bounds that establish convergence to an approximate stationary point. From these results, we infer that the first algorithm converges at a rate that is comparable to the well-known REINFORCE algorithm in an off-policy RL context, while the second algorithm exhibits an improved rate of convergence.

Paper Structure

This paper contains 13 sections, 25 theorems, 108 equations, 1 figure, 1 table, 3 algorithms.

Key Result

Theorem 1

Assume as:pol_cont--as:proper. Let $P_R(k)=\mathbb{P}(R=k)=\frac{\alpha_k }{\sum_{k=0}^{N-1}\alpha_k}$, $\forall N \in \mathbb{N}$, and $J^*=\max_{\theta\in\Theta} J(\theta)$. Then, where $L$ is the Lipschitz constant of $J$ as well as $\nabla J$ (see Lemma lm:J_lip in Section sec:conv below).

Figures (1)

  • Figure 1: CartPole with fixed initial state

Theorems & Definitions (56)

  • Definition 1
  • Theorem 1: OffP-SF
  • proof
  • Corollary 1: OffP-SF
  • proof
  • Remark 1
  • Remark 2
  • Theorem 2: OffP-SF-SVRG
  • proof
  • Remark 3
  • ...and 46 more