Optimal Control of Nonlinear Systems with Unknown Dynamics

Wenjian Hao; Paulo C. Heredia; Shaoshuai Mou

Optimal Control of Nonlinear Systems with Unknown Dynamics

Wenjian Hao, Paulo C. Heredia, Shaoshuai Mou

TL;DR

The paper tackles optimal control for systems with unknown nonlinear dynamics by marrying a Deep Koopman operator (DKO) lift with an actor–critic policy gradient framework, yielding a data-driven method to synthesize a closed-loop controller without explicit model knowledge. The proposed PGDK method jointly learns lifted dynamics, a TD-based critic, and a policy, enabling gradient-based optimization of $\boldsymbol{\theta}^{\mu}$ using one-step predictions and data from $\mathcal{D}$. The authors provide convergence analyses under Robbins–Monro step sizes and timescale separation, showing global optimality in convex settings and local convergence otherwise, with robust behavior under gradient approximation errors. Empirical results on an LTI system and a nonlinear inverted pendulum demonstrate improved data efficiency and performance close to model-based baselines such as LQR and MPC, and show that online PGDK can outperform some off-policy RL baselines in sample efficiency.

Abstract

This paper presents a data-driven method to find a closed-loop optimal controller, which minimizes a specified infinite-horizon cost function for systems with unknown dynamics. Suppose the closed-loop optimal controller can be parameterized by a given class of functions, hereafter referred to as the policy. The proposed method introduces a novel gradient estimation framework, which approximates the gradient of the cost function with respect to the policy parameters via integrating the Koopman operator with the classical concept of actor-critic. This enables the policy parameters to be tuned iteratively using gradient descent to achieve an optimal controller, leveraging the linearity of the Koopman operator. The convergence analysis of the proposed framework is provided. The control performance of the proposed method is evaluated through simulations compared with classical optimal control methods that usually assume the dynamics are known.

Optimal Control of Nonlinear Systems with Unknown Dynamics

TL;DR

using one-step predictions and data from

. The authors provide convergence analyses under Robbins–Monro step sizes and timescale separation, showing global optimality in convex settings and local convergence otherwise, with robust behavior under gradient approximation errors. Empirical results on an LTI system and a nonlinear inverted pendulum demonstrate improved data efficiency and performance close to model-based baselines such as LQR and MPC, and show that online PGDK can outperform some off-policy RL baselines in sample efficiency.

Abstract

Paper Structure (18 sections, 3 theorems, 52 equations, 7 figures, 2 algorithms)

This paper contains 18 sections, 3 theorems, 52 equations, 7 figures, 2 algorithms.

Introduction
The Problem
Main Results
Challenges and Key Ideas
The Proposed Framework
Analysis
Numerical Simulations
Online Implementation
Numerical Simulations
LTI System
Simulated Inverted Pendulum
Concluding Remarks
Appendix
Proof of Lemma \ref{['thm1']}
Proof of Theorem \ref{['thm2']}
...and 3 more sections

Key Result

Lemma 1

If Assumption asp1 holds and $\boldsymbol{\theta}_k^f$ is updated following eq_gd_thetaf with step size $\alpha_k^f = \frac{1}{L_{f1}(2+k)}$, then where $\nu_f = \max\{L_{f2}/L_{f1}^2, 2\parallel\boldsymbol{\theta}_0^f-\boldsymbol{\theta}^{f*}\parallel^2\}$, $L_{f1}$ is a constant, and $L_{f2}$ is a constant determined by the residual vectors from the least squares solutions in lmn. Furthermore,

Figures (7)

Figure 1: PGDK framework for policy gradient estimation.
Figure 2: PGDK framework under online implementation.
Figure 3: Learning and testing stage cost using the same initial states. Here, $\bar{c}$ denotes the averaged stage cost over each episode to account for the variance of different initial states. The solid line represents the mean stage cost across $5$ experiment trials, while the shaded region and error bars indicate the standard deviation.
Figure 4: Trajectories from PGDK and LQR.
Figure 5: Learning losses of the online (blue) and offline (black) PGDK, where $L_f$, $L_J$, and $\hat{J}$ represent the loss functions computed over the entire dataset at iteration $k$, while $L_k^f$, $L_k^J$, and $\hat{J}_k^\mu$ denote their corresponding values computed using the sampled data batches.
...and 2 more figures

Theorems & Definitions (5)

Remark 1
Definition 1
Lemma 1
Lemma 2
Theorem 1

Optimal Control of Nonlinear Systems with Unknown Dynamics

TL;DR

Abstract

Optimal Control of Nonlinear Systems with Unknown Dynamics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (5)