Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators

Yinbin Han; Meisam Razaviyayn; Renyuan Xu

Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators

Yinbin Han, Meisam Razaviyayn, Renyuan Xu

TL;DR

This work Studies learning optimal control for systems that are linear plus a small nonlinear kernel term, cast under the policy class $u_t = -K_1 x_t - K_2 \phi(x_t)$. It develops a zeroth-order policy gradient method with a carefully designed initialization based on a linear-quadratic surrogate and least-squares parameter recovery, proving that the cost is locally strongly convex and smooth around a neighborhood containing the global optimum $K^*$. The authors establish finite-sample guarantees: exact recovery of system parameters via least squares, local landscape properties, and convergence of the zeroth-order algorithm to $K^*$ at a linear rate under small Lipschitz nonlinearities. They validate the approach with numerical experiments on synthetic near-LQR systems, demonstrating fast convergence, robustness to initialization, and resilience to moderate nonlinearities. Overall, the paper provides a rigorous route to globally optimal policy learning for nearly linear-quadratic regulators in model-free settings, with insights into initialization, sample efficiency, and the role of kernel-based nonlinearities.

Abstract

Nonlinear control systems with partial information to the decision maker are prevalent in a variety of applications. As a step toward studying such nonlinear systems, this work explores reinforcement learning methods for finding the optimal policy in the nearly linear-quadratic regulator systems. In particular, we consider a dynamic system that combines linear and nonlinear components, and is governed by a policy with the same structure. Assuming that the nonlinear component comprises kernels with small Lipschitz coefficients, we characterize the optimization landscape of the cost function. Although the cost function is nonconvex in general, we establish the local strong convexity and smoothness in the vicinity of the global optimizer. Additionally, we propose an initialization mechanism to leverage these properties. Building on the developments, we design a policy gradient algorithm that is guaranteed to converge to the globally optimal policy with a linear rate.

Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators

TL;DR

This work Studies learning optimal control for systems that are linear plus a small nonlinear kernel term, cast under the policy class

. It develops a zeroth-order policy gradient method with a carefully designed initialization based on a linear-quadratic surrogate and least-squares parameter recovery, proving that the cost is locally strongly convex and smooth around a neighborhood containing the global optimum

. The authors establish finite-sample guarantees: exact recovery of system parameters via least squares, local landscape properties, and convergence of the zeroth-order algorithm to

at a linear rate under small Lipschitz nonlinearities. They validate the approach with numerical experiments on synthetic near-LQR systems, demonstrating fast convergence, robustness to initialization, and resilience to moderate nonlinearities. Overall, the paper provides a rigorous route to globally optimal policy learning for nearly linear-quadratic regulators in model-free settings, with insights into initialization, sample efficiency, and the role of kernel-based nonlinearities.

Abstract

Paper Structure (20 sections, 15 theorems, 147 equations, 2 figures, 2 algorithms)

This paper contains 20 sections, 15 theorems, 147 equations, 2 figures, 2 algorithms.

Introduction
Our Contributions.
Related Work.
Notation.
Problem Setup
Proposed Algorithm
Zeroth-order Optimization Method.
Efficient Initialization.
Main Results
Least-Squares Regression for Parameters Recovery
Landscape and Convergence Analysis
Numerical Experiments
Model and Parameter Setup
Evaluation
Discussion
...and 5 more sections

Key Result

Proposition 4.6

Assume Assumptions ass:feature and ass:initial_distr hold. For any $\nu \in (0, 1)$, $\Phi_{N}^{\top}\Phi_{N}$ is invertible for all $N \gtrsim n+p+d$ with probability at least $1 - \nu$. In consequence, $\widehat{\Theta} = \Theta$ with probability at least $1 - \nu$.

Figures (2)

Figure 1: Convergence of the policy gradient algorithm
Figure 2: Robustness of policy gradient algorithm.

Theorems & Definitions (31)

Example 4.2
proof : Proof of Example \ref{['prop: kernel']}
Proposition 4.6
proof : Proof of Proposition \ref{['prop: inv of design matrix']}
Theorem 4.7
proof : Proof of Theorem \ref{['thm: landscape']}
Theorem 4.8
proof : Proof of Theorem \ref{['thm: conv of algo']}
Lemma 6.1: vershynin2010introduction
Lemma 6.2
...and 21 more

Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators

TL;DR

Abstract

Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (31)