Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators
Yinbin Han, Meisam Razaviyayn, Renyuan Xu
TL;DR
This work Studies learning optimal control for systems that are linear plus a small nonlinear kernel term, cast under the policy class $u_t = -K_1 x_t - K_2 \phi(x_t)$. It develops a zeroth-order policy gradient method with a carefully designed initialization based on a linear-quadratic surrogate and least-squares parameter recovery, proving that the cost is locally strongly convex and smooth around a neighborhood containing the global optimum $K^*$. The authors establish finite-sample guarantees: exact recovery of system parameters via least squares, local landscape properties, and convergence of the zeroth-order algorithm to $K^*$ at a linear rate under small Lipschitz nonlinearities. They validate the approach with numerical experiments on synthetic near-LQR systems, demonstrating fast convergence, robustness to initialization, and resilience to moderate nonlinearities. Overall, the paper provides a rigorous route to globally optimal policy learning for nearly linear-quadratic regulators in model-free settings, with insights into initialization, sample efficiency, and the role of kernel-based nonlinearities.
Abstract
Nonlinear control systems with partial information to the decision maker are prevalent in a variety of applications. As a step toward studying such nonlinear systems, this work explores reinforcement learning methods for finding the optimal policy in the nearly linear-quadratic regulator systems. In particular, we consider a dynamic system that combines linear and nonlinear components, and is governed by a policy with the same structure. Assuming that the nonlinear component comprises kernels with small Lipschitz coefficients, we characterize the optimization landscape of the cost function. Although the cost function is nonconvex in general, we establish the local strong convexity and smoothness in the vicinity of the global optimizer. Additionally, we propose an initialization mechanism to leverage these properties. Building on the developments, we design a policy gradient algorithm that is guaranteed to converge to the globally optimal policy with a linear rate.
