Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint
Nithia Vijayan, Prashanth L. A
TL;DR
This work tackles off-policy reinforcement learning by introducing two policy-gradient algorithms that integrate smoothed functional gradient estimation with off-policy evaluation via importance sampling. The plain OffP-SF method achieves a non-asymptotic convergence rate of $O\left(\frac{1}{\sqrt{N}}\right)$, while the SVRG-inspired OffP-SF-SVRG attains a faster $O\left(\frac{1}{N}\right)$ rate, both converging to an $\epsilon$-stationary point. A benchmark OffP-REINFORCE is analyzed for comparison, with all methods equipped with universal-step-size guarantees. The paper also validates the approach on CartPole, showing competitive performance and highlighting the benefit of variance reduction in the off-policy setting. Overall, the results establish that SF-based gradient estimation is a viable alternative to likelihood-ratio-based methods for off-policy policy optimization, with SVRG-style variance reduction yielding superior convergence rates.
Abstract
We propose two policy gradient algorithms for solving the problem of control in an off-policy reinforcement learning (RL) context. Both algorithms incorporate a smoothed functional (SF) based gradient estimation scheme. The first algorithm is a straightforward combination of importance sampling-based off-policy evaluation with SF-based gradient estimation. The second algorithm, inspired by the stochastic variance-reduced gradient (SVRG) algorithm, incorporates variance reduction in the update iteration. For both algorithms, we derive non-asymptotic bounds that establish convergence to an approximate stationary point. From these results, we infer that the first algorithm converges at a rate that is comparable to the well-known REINFORCE algorithm in an off-policy RL context, while the second algorithm exhibits an improved rate of convergence.
