Table of Contents
Fetching ...

Convergence Guarantees of Model-free Policy Gradient Methods for LQR with Stochastic Data

Bowen Song, Andrea Iannelli

Abstract

Policy gradient (PG) methods are the backbone of many reinforcement learning algorithms due to their good performance in policy optimization problems. As a gradient-based approach, PG methods typically rely on knowledge of the system dynamics. If this is not available, trajectory data can be utilized to approximate first-order information. When the data are noisy, gradient estimates become inaccurate and a study that investigates uncertainty estimation and the analysis of its propagation through the algorithm is currently missing. To address this, our work focuses on the Linear Quadratic Regulator (LQR) problem for systems subject to additive stochastic noise. After briefly summarizing the state of the art for cases with a known model, we focus on scenarios where the system dynamics are unknown, and approximate gradient information is obtained using zeroth-order optimization techniques. We analyze the theoretical properties by computing the error in the estimated gradient and examining how this error affects the convergence of PG algorithms. Additionally, we provide global convergence guarantees for various versions of PG methods, including those employing adaptive step sizes and variance reduction techniques, which help increase the convergence rate and reduce sample complexity. This study contributed to characterizing the robustness of model-free PG methods, aiming to identify their limitations in the presence of stochastic noise and proposing improvements to enhance their applicability.

Convergence Guarantees of Model-free Policy Gradient Methods for LQR with Stochastic Data

Abstract

Policy gradient (PG) methods are the backbone of many reinforcement learning algorithms due to their good performance in policy optimization problems. As a gradient-based approach, PG methods typically rely on knowledge of the system dynamics. If this is not available, trajectory data can be utilized to approximate first-order information. When the data are noisy, gradient estimates become inaccurate and a study that investigates uncertainty estimation and the analysis of its propagation through the algorithm is currently missing. To address this, our work focuses on the Linear Quadratic Regulator (LQR) problem for systems subject to additive stochastic noise. After briefly summarizing the state of the art for cases with a known model, we focus on scenarios where the system dynamics are unknown, and approximate gradient information is obtained using zeroth-order optimization techniques. We analyze the theoretical properties by computing the error in the estimated gradient and examining how this error affects the convergence of PG algorithms. Additionally, we provide global convergence guarantees for various versions of PG methods, including those employing adaptive step sizes and variance reduction techniques, which help increase the convergence rate and reduce sample complexity. This study contributed to characterizing the robustness of model-free PG methods, aiming to identify their limitations in the presence of stochastic noise and proposing improvements to enhance their applicability.

Paper Structure

This paper contains 40 sections, 18 theorems, 203 equations, 4 figures, 5 algorithms.

Key Result

Lemma 1

The function $C$ on the set $\mathcal{S}$ is gradient dominated. That is, for any $K \in \mathcal{S}$, the following inequality holds: with $\mu:=\frac{1}{4}\lVert \Sigma_{K^*} \rVert\lVert \Sigma_w ^{-2}\rVert\lVert R ^{-1}\rVert$. $\blacktriangleleft$$\blacktriangleleft$

Figures (4)

  • Figure 1: Performance of Gradient Descent with inexact noisy gradient for different values of step size and noise
  • Figure 2: Effect of noise level on step size
  • Figure 3: PGD with/out variance reduction
  • Figure 4: NPG with/out adaptive step size

Theorems & Definitions (36)

  • Remark 1: Weighting matrices $Q,R$
  • Lemma 1: Gradient Domination
  • Lemma 2: Almost Smoothness on $\mathcal{S}$
  • Lemma 3: $\Sigma_{K}$ Perturbation
  • Lemma 4: $C$ Perturbation
  • Lemma 5: $\nabla C$ Perturbation
  • Theorem 1: Policy gradient descent with Adaptive Step Size
  • Remark 2: Advantage of Adaptive Step Size over Fixed Step Size
  • Theorem 2: Natural Policy Gradient with Adaptive Step Size
  • Theorem 3: Gauss-Newton Method
  • ...and 26 more