Model-Agnostic Zeroth-Order Policy Optimization for Meta-Learning of Ergodic Linear Quadratic Regulators

Yunian Pan; Quanyan Zhu

Model-Agnostic Zeroth-Order Policy Optimization for Meta-Learning of Ergodic Linear Quadratic Regulators

Yunian Pan, Quanyan Zhu

TL;DR

This work tackles meta-learning for a family of ergodic LQR tasks under unknown dynamics by introducing a model-agnostic, zeroth-order meta-gradient approach. The method formulates a meta-objective L(K) that evaluates performance after a one-step policy update and uses a Monte-Carlo zeroth-order estimator to approximate the meta-gradient ∇ℒ(K) without explicitly computing the policy Hessian. The authors prove boundedness and Lipschitz properties for ∇ℒ(K) and establish convergence of exact gradient descent on the meta-objective under a suitable step size, while acknowledging that the practical meta-gradient is biased and warrants further analysis of sample complexity. Numerical experiments on a small ensemble of similar LQRs demonstrate that the proposed approach reduces the average-cost gap to the optimum across tasks, supporting its potential for rapid adaptation in control problems with varying dynamics. Overall, the paper offers a practical, Hessian-free framework for cross-task adaptation in LQR settings with potential impact on robust and adaptive control applications.

Abstract

Meta-learning has been proposed as a promising machine learning topic in recent years, with important applications to image classification, robotics, computer games, and control systems. In this paper, we study the problem of using meta-learning to deal with uncertainty and heterogeneity in ergodic linear quadratic regulators. We integrate the zeroth-order optimization technique with a typical meta-learning method, proposing an algorithm that omits the estimation of policy Hessian, which applies to tasks of learning a set of heterogeneous but similar linear dynamic systems. The induced meta-objective function inherits important properties of the original cost function when the set of linear dynamic systems are meta-learnable, allowing the algorithm to optimize over a learnable landscape without projection onto the feasible set. We provide a convergence result for the exact gradient descent process by analyzing the boundedness and smoothness of the gradient for the meta-objective, which justify the proposed algorithm with gradient estimation error being small. We also provide a numerical example to corroborate this perspective.

Model-Agnostic Zeroth-Order Policy Optimization for Meta-Learning of Ergodic Linear Quadratic Regulators

TL;DR

Abstract

Paper Structure (12 sections, 7 theorems, 39 equations, 1 figure, 3 algorithms)

This paper contains 12 sections, 7 theorems, 39 equations, 1 figure, 3 algorithms.

INTRODUCTION
RELATED WORK
PROBLEM FORMULATION
Policy Optimization for LQR
Zero-th Order Method for Gradient-Estimation
The Meta-Learning Problem
Zero-th Order Method for Meta-Gradient Estimation
GRADIENT DESCENT ANALYSIS
On the boundedness and Lipschitz Property of $\nabla \mathcal{L}(K)$
Convergence of Exact Gradient Descent
Numerical Results
CONCLUSIONS

Key Result

Proposition 1

The expression for average cost is $J_i(K) = \operatorname{Tr}(P^i_K \Psi_i)$, and the expression of $\nabla J_i(K)$ is where $\Sigma^i_{K}$ satisfies xgramian, $E^i_K$ is defined to be and $P^i_K$ is the unique positive definite solution to the Bellman equation.

Figures (1)

Figure 1: The plot shows three curves encapsulating the changing of average performance during gradient descent, each corresponds to a particular dimension setting of state and action space, (green: $d = 20, k = 10$, orange: $d = 2, k = 2$, blue: $d= 1, k =1$.) constant learning rates $\alpha = 1e-3$, $\eta = 1e-5$ for orange and blue cases and $\alpha = 1e-5$, $\eta = 1e-7$ for green curve, numbers of meta and inner perturbation $D = 100, M= 100$, gradient smooth parameter $r = 0.05$, roll out length $\ell = 50$.

Theorems & Definitions (8)

Proposition 1: Policy Gradient of Ergodic LQR DBLP:journals/corr/abs-1907-06246
lemma 1: Perturbation analysis adapted from fazel2018global
Definition 1: First-order Stationary Point
lemma 2
lemma 3
Corollary 1
lemma 4: Perturbation analysis of $\nabla \mathcal{L}(K)$
Theorem 1

Model-Agnostic Zeroth-Order Policy Optimization for Meta-Learning of Ergodic Linear Quadratic Regulators

TL;DR

Abstract

Model-Agnostic Zeroth-Order Policy Optimization for Meta-Learning of Ergodic Linear Quadratic Regulators

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (8)