Efficiently Escaping Saddle Points for Policy Optimization

Sadegh Khorasani; Saber Salehkaleybar; Negar Kiyavash; Niao He; Matthias Grossglauser

Efficiently Escaping Saddle Points for Policy Optimization

Sadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash, Niao He, Matthias Grossglauser

TL;DR

This work tackles non-convex policy optimization in reinforcement learning by introducing VR-SCP, a variance-reduced cubic-regularized Newton method that leverages Hessian-vector products to incorporate second-order information without relying on importance sampling. The algorithm achieves an approximate second-order stationary point, $(\\epsilon,\\sqrt{\\rho\\epsilon})$-SOSP, with a sample complexity of $\\tilde{O}(\\epsilon^{-3})$, improving the best-known rate by a factor of $\\epsilon^{-0.5}$. Key ideas include a Hessian-aided variance reduction technique that bypasses IS weights, high-probability error control for gradient and Hessian estimates, and a cubic-subsolver framework that ensures progress toward SOSP. Empirically, VR-SCP outperforms state-of-the-art variance-reduced policy gradient methods and exhibits robustness to random seeds across multiple MuJoCo control tasks, validating both its theoretical guarantees and practical impact.

Abstract

Policy gradient (PG) is widely used in reinforcement learning due to its scalability and good performance. In recent years, several variance-reduced PG methods have been proposed with a theoretical guarantee of converging to an approximate first-order stationary point (FOSP) with the sample complexity of $O(ε^{-3})$. However, FOSPs could be bad local optima or saddle points. Moreover, these algorithms often use importance sampling (IS) weights which could impair the statistical effectiveness of variance reduction. In this paper, we propose a variance-reduced second-order method that uses second-order information in the form of Hessian vector products (HVP) and converges to an approximate second-order stationary point (SOSP) with sample complexity of $\tilde{O}(ε^{-3})$. This rate improves the best-known sample complexity for achieving approximate SOSPs by a factor of $O(ε^{-0.5})$. Moreover, the proposed variance reduction technique bypasses IS weights by using HVP terms. Our experimental results show that the proposed algorithm outperforms the state of the art and is more robust to changes in random seeds.

Efficiently Escaping Saddle Points for Policy Optimization

TL;DR

-SOSP, with a sample complexity of

, improving the best-known rate by a factor of

. Key ideas include a Hessian-aided variance reduction technique that bypasses IS weights, high-probability error control for gradient and Hessian estimates, and a cubic-subsolver framework that ensures progress toward SOSP. Empirically, VR-SCP outperforms state-of-the-art variance-reduced policy gradient methods and exhibits robustness to random seeds across multiple MuJoCo control tasks, validating both its theoretical guarantees and practical impact.

Abstract

. However, FOSPs could be bad local optima or saddle points. Moreover, these algorithms often use importance sampling (IS) weights which could impair the statistical effectiveness of variance reduction. In this paper, we propose a variance-reduced second-order method that uses second-order information in the form of Hessian vector products (HVP) and converges to an approximate second-order stationary point (SOSP) with sample complexity of

. This rate improves the best-known sample complexity for achieving approximate SOSPs by a factor of

. Moreover, the proposed variance reduction technique bypasses IS weights by using HVP terms. Our experimental results show that the proposed algorithm outperforms the state of the art and is more robust to changes in random seeds.

Paper Structure (20 sections, 15 theorems, 69 equations, 2 figures, 2 tables, 3 algorithms)

This paper contains 20 sections, 15 theorems, 69 equations, 2 figures, 2 tables, 3 algorithms.

Introduction
Preliminaries
Notations and problem definition
Variance reduced methods for gradient estimation
Stochastic Cubic Regularized Newton
VR-SCP Algorithm
Description
Convergence Analysis
Proof sketch.
Related Work
Experiments
Conclusion
Auxiliary Lemmas
Convergence Analysis
Proof of Lemma \ref{['lem:grad_bound']}
...and 5 more sections

Key Result

Lemma 3.3

shen2019hessianwang2022stochastic Under Assumptions assum:1 and assum:2, for any $\theta_1,\theta_2, w\in \mathbb{R}^d$ and for any trajectory $\tau$, there exist constants $W$, $L$ and $\rho$ such that: where $\rho=\frac{L_2 R_0+2R_0GHL_1}{(1-\gamma)^2}, L= \frac{L_1GR_0}{(1-\gamma)^2} , W = \frac{GR_0}{(1-\gamma)^2}$.

Figures (2)

Figure 1: Each configuration is evaluated with five different random seeds.
Figure 2: Comparison of VR-SCP with other variance reduction methods on four control tasks.

Theorems & Definitions (17)

Lemma 3.3
Remark 3.4
Lemma 3.5
Definition 3.6
Lemma 3.7
Lemma 3.8
Theorem 3.9
Corollary 3.10
Lemma A.1
Lemma A.2
...and 7 more

Efficiently Escaping Saddle Points for Policy Optimization

TL;DR

Abstract

Efficiently Escaping Saddle Points for Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (17)